How to Validate Data Consistency After Migration
This topic describes how to validate data consistency after you complete a data migration task using Data Online Migration.
Data Online Migration does not guarantee data consistency. You must validate data consistency.
During the validation, requesting data from the source or destination may incur request and traffic fees.
If data is being written to the source or destination during the validation, the validation results may be affected.
Validation Dimensions
After data migration, data consistency between the source and destination is not guaranteed. You must perform data validation. Data validation involves three dimensions: file count consistency, file content consistency, and file metadata (partial) consistency. You must perform all three types of validation to ensure a complete data validation process.
File count consistency: The set of source files (S1) and the set of files migrated to the destination (S2) are consistent, which means no files are missing. This is a quantity-based dimension.
File content consistency: The content of each file is consistent between the source and destination, which means the content is not corrupted or disordered. This is a content-based dimension.
File metadata (partial) consistency: The metadata entries for each file are consistent between the source and destination, which means no metadata entries are missing or disordered. This is a metadata-based dimension. Data Online Migration supports the migration of only partial metadata. For more information, see the Notes and Limits sections in the relevant migration tutorials.
Validation Logic
Different destination types offer different validation features. For example, OSS can generate a file list for a bucket and provides reliable CRC-64 values for all uploaded objects. A local file system (LocalFS) does not have similar features. You must list the files and calculate the checksums yourself.
The following describes the data consistency validation logic for different destination types:
Destination Type | Validation Dimension | Validation Logic |
OSS | File count consistency |
|
File content consistency |
| |
File metadata (partial) consistency |
| |
LocalFS | File count consistency |
|
File content consistency |
| |
File metadata (partial) consistency |
|
Cost and Performance Optimization Suggestions
Use a private network for validation: You can use a private network to request data from both the source and destination. This practice effectively reduces network latency and eliminates or lowers related costs.
Optimize the validation logic: When you validate file content consistency, first compare the file sizes. If the sizes are inconsistent, mark the validation as failed. If the sizes are consistent, proceed to retrieve the checksums.
Use a sampling validation strategy: For scenarios that involve a massive amount of data, such as terabytes or petabytes of storage and hundreds of millions of files, a full validation is extremely costly. To ensure data credibility, you can use a sampling strategy. You can include file samples with different features to achieve data consistency validation at a lower cost. The following suggestions are provided:
Sample by file size:
Small file samples: Extract several files that are smaller than 150 MB. These files are typically uploaded using a simple upload method.
Large file samples: Extract several files that are larger than 150 MB. These files are typically uploaded using concurrent sharding.
Extra-large file samples: If your files exceed 5 GB in size, sample them separately. These files typically have long transfer periods and consume high bandwidth.
Directory and empty file samples: These are edge cases. Confirm that they are successfully created and their metadata meets expectations, such as the Uid, Gid, and Permissions of directories.
Sample by business features and metadata:
Core business samples: If the migrated data includes core business files, perform a full validation on these files.
File type samples: Based on your business type, extract files of key formats, such as
.jpg,.pdf,.log,.json, and.sql.Special metadata samples: Based on your business type, if source files have custom metadata, such as
x-oss-meta-*or specific HTTP headers, extract these files to validate the consistency of the custom metadata.Hot and cold data samples: Extract recently updated files and historical archived objects. Validate whether metadata, such as LastModified, is accurately retained.
Path and naming samples: For example, you can sample files with deep directory levels (such as more than 10 levels) and files with spaces, Chinese characters, Unicode characters, or special symbols in their paths and names. Validate for encoding, decoding, and escape issues.
Random statistical sampling:
Use a random algorithm to extract a certain percentage of files, such as 1% to 5%, from the file list. This helps discover irregular random errors.
A sampling validation strategy is a compromise that balances validation cost and efficiency in scenarios that involve a massive amount of data. Sampling validation is based on statistical principles and carries a risk of missed detections. It cannot replace a full validation. For core business data, a full validation is still strongly recommended.
Validation Issues and Troubleshooting
If you find data inconsistencies during the validation, perform the following steps to troubleshoot the issues:
Confirm whether the file was updated on the source or destination.
Retrieve all information about the file from the migration report. Compare the information with the source and destination information separately and analyze the possible causes.
If you cannot locate the issue, submit a ticket to contact technical support. In the ticket, specify information such as the console region, task ID, and file path.
Mark Task Status
After you complete the validation and confirm that no issues exist, log on to the Data Online Migration console. On the Task List page, find the task and select Confirm data integrity and consistency in the Status column to confirm that the validation is complete.