How to Validate Data Consistency After Migration

更新时间:
复制 MD 格式

This topic describes how to validate data consistency after you complete a data migration task using Data Online Migration.

Important
  • Data Online Migration does not guarantee data consistency. You must validate data consistency.

  • During the validation, requesting data from the source or destination may incur request and traffic fees.

  • If data is being written to the source or destination during the validation, the validation results may be affected.

Validation Dimensions

After data migration, data consistency between the source and destination is not guaranteed. You must perform data validation. Data validation involves three dimensions: file count consistency, file content consistency, and file metadata (partial) consistency. You must perform all three types of validation to ensure a complete data validation process.

  • File count consistency: The set of source files (S1) and the set of files migrated to the destination (S2) are consistent, which means no files are missing. This is a quantity-based dimension.

  • File content consistency: The content of each file is consistent between the source and destination, which means the content is not corrupted or disordered. This is a content-based dimension.

  • File metadata (partial) consistency: The metadata entries for each file are consistent between the source and destination, which means no metadata entries are missing or disordered. This is a metadata-based dimension. Data Online Migration supports the migration of only partial metadata. For more information, see the Notes and Limits sections in the relevant migration tutorials.

Validation Logic

Different destination types offer different validation features. For example, OSS can generate a file list for a bucket and provides reliable CRC-64 values for all uploaded objects. A local file system (LocalFS) does not have similar features. You must list the files and calculate the checksums yourself.

The following describes the data consistency validation logic for different destination types:

Destination Type

Validation Dimension

Validation Logic

OSS

File count consistency

  • Validation principle: Traverse the source file collection and ensure each file exists in the destination collection.

  • Validation steps:

    • Get the source file list: Obtain it using the relevant traversal interface, or generate it using the list feature (if available).

    • Get the destination file list: Obtain it using the ListObjects interface, or generate it using the OSS list feature.

    • Perform validation: Traverse the source file list and ensure each file exists in the destination file list.

File content consistency

  • Validation principle: Compare the checksums of each file.

  • Validation steps:

    • Get the source CRC-64: If the source has the same type of CRC-64 value, obtain it directly using the interface. Otherwise, read and calculate the source file's CRC-64 value one by one.

    • Get the destination CRC-64: The CRC-64 value for each successfully migrated object is listed in the migration report. You can also call the HeadObject interface to retrieve it from OSS in real time.

    • Perform validation: Compare the CRC-64 values obtained from both ends.

File metadata (partial) consistency

  • Validation principle: Compare the (partial) metadata entries of each file.

  • Validation steps:

    • Get the source file metadata: Obtain it using the relevant interface, such as S3's HeadObject.

    • Get the destination file metadata: Obtain it using the HeadObject interface.

    • Perform validation: Compare the two sets of metadata entries (only those supported by Data Online Migration) one by one.

LocalFS

File count consistency

  • Validation principle: Traverse the source file collection and ensure each file exists in the destination collection.

  • Validation steps:

    • Get the source file list: Obtain it using the relevant traversal interface, or generate it using the list feature (if available).

    • Get the destination file list: Obtain it only using the relevant traversal interface.

    • Perform validation: Traverse the source file list and ensure each file exists in the destination file list.

File content consistency

  • Validation principle: Compare the checksums of each file.

  • Validation steps:

    • Get the source checksum: If the source has a directly usable checksum (MD5/CRC32/CRC-64, etc.), obtain it directly using the interface. Otherwise, select a validation algorithm (MD5 is recommended) in advance, then read and calculate the source file's checksum using that algorithm one by one.

    • Get the destination checksum: Because file systems do not have checksum metadata, the corresponding validation field in the migration report is unavailable. Read and calculate the LocalFS file's checksum using the same algorithm as the source, one by one.

    • Perform validation: Compare the checksums obtained from both ends.

File metadata (partial) consistency

  • Validation principle: Compare the (partial) metadata entries of each file.

  • Validation steps:

    • Get the source file metadata: Obtain it using the relevant interface, such as S3's HeadObject.

    • Get the destination file metadata: Obtain it using a POSIX file I/O interface, such as stat.

    • Perform validation: Compare the two sets of metadata entries (only those supported by Data Online Migration) one by one.

Cost and Performance Optimization Suggestions

  • Use a private network for validation: You can use a private network to request data from both the source and destination. This practice effectively reduces network latency and eliminates or lowers related costs.

  • Optimize the validation logic: When you validate file content consistency, first compare the file sizes. If the sizes are inconsistent, mark the validation as failed. If the sizes are consistent, proceed to retrieve the checksums.

  • Use a sampling validation strategy: For scenarios that involve a massive amount of data, such as terabytes or petabytes of storage and hundreds of millions of files, a full validation is extremely costly. To ensure data credibility, you can use a sampling strategy. You can include file samples with different features to achieve data consistency validation at a lower cost. The following suggestions are provided:

    • Sample by file size:

      • Small file samples: Extract several files that are smaller than 150 MB. These files are typically uploaded using a simple upload method.

      • Large file samples: Extract several files that are larger than 150 MB. These files are typically uploaded using concurrent sharding.

      • Extra-large file samples: If your files exceed 5 GB in size, sample them separately. These files typically have long transfer periods and consume high bandwidth.

      • Directory and empty file samples: These are edge cases. Confirm that they are successfully created and their metadata meets expectations, such as the Uid, Gid, and Permissions of directories.

    • Sample by business features and metadata:

      • Core business samples: If the migrated data includes core business files, perform a full validation on these files.

      • File type samples: Based on your business type, extract files of key formats, such as .jpg, .pdf, .log, .json, and .sql.

      • Special metadata samples: Based on your business type, if source files have custom metadata, such as x-oss-meta-* or specific HTTP headers, extract these files to validate the consistency of the custom metadata.

      • Hot and cold data samples: Extract recently updated files and historical archived objects. Validate whether metadata, such as LastModified, is accurately retained.

      • Path and naming samples: For example, you can sample files with deep directory levels (such as more than 10 levels) and files with spaces, Chinese characters, Unicode characters, or special symbols in their paths and names. Validate for encoding, decoding, and escape issues.

    • Random statistical sampling:

      • Use a random algorithm to extract a certain percentage of files, such as 1% to 5%, from the file list. This helps discover irregular random errors.

Important

A sampling validation strategy is a compromise that balances validation cost and efficiency in scenarios that involve a massive amount of data. Sampling validation is based on statistical principles and carries a risk of missed detections. It cannot replace a full validation. For core business data, a full validation is still strongly recommended.

Validation Issues and Troubleshooting

If you find data inconsistencies during the validation, perform the following steps to troubleshoot the issues:

  • Confirm whether the file was updated on the source or destination.

  • Retrieve all information about the file from the migration report. Compare the information with the source and destination information separately and analyze the possible causes.

  • If you cannot locate the issue, submit a ticket to contact technical support. In the ticket, specify information such as the console region, task ID, and file path.

Mark Task Status

After you complete the validation and confirm that no issues exist, log on to the Data Online Migration console. On the Task List page, find the task and select Confirm data integrity and consistency in the Status column to confirm that the validation is complete.