Overview of ossimport

更新时间:
复制 MD 格式

ossimport is a tool for migrating data to Object Storage Service (OSS). You can deploy ossimport on a local server or an Elastic Compute Service (ECS) instance to migrate data from local storage or other cloud storage services to OSS. Alibaba Cloud also provides standardized Data Online Migration and cross-region replication services. These services offer a graphical user interface (GUI) to help you migrate data to the cloud and replicate data between OSS buckets.

Important
  • ossimport does not verify files after migration. It cannot guarantee the correctness or consistency of the migration results. After a migration task is complete, you must verify the data consistency between the source and destination.

    If you delete the source data before you verify data consistency, you are liable for any resulting data loss. For more information, see Data Transport Service Agreement.

  • To migrate data from other third-party data sources, we recommend that you use Data Online Migration to migrate your data.

  • To perform near-real-time replication of OSS data between two different buckets, we recommend that you use the OSS cross-region replication feature.

Features

  • Supports a wide range of data sources, such as local files, Qiniu Kodo, Baidu BOS, Amazon S3, Azure Blob, UPYUN USS, Tencent Cloud COS, Kingsoft KS3, lists of HTTP or HTTPS URLs, and Alibaba Cloud OSS. It can also be extended to support other sources as needed.

  • Supports standalone and distributed modes. The standalone mode is simple to deploy and easy to use. The distributed mode is suitable for large-scale data migration.

    Note

    In standalone mode, you can migrate only one bucket at a time.

  • Supports resumable uploads.

  • Supports traffic shaping.

  • Supports migrating files modified after a specific time or files with a specific prefix.

  • Supports concurrent data downloads and uploads.

Billing

The ossimport tool is free of charge. In a public cloud environment, the data source may incur costs such as outbound traffic fees and request fees. The destination OSS bucket will incur request fees. If you enable and use transfer acceleration for OSS, transfer acceleration fees will also be incurred.

Usage notes

  • Migration speed

    The migration speed of ossimport depends on factors such as the read bandwidth of the source, local network bandwidth, and file sizes. Migrating files smaller than 200 KB might be slow because of high input/output operations per second (IOPS) usage.

  • Source data is in the Archive storage class

    When you migrate data using ossimport, if the source data is in the Archive storage class, you must restore the data before you start the migration.

  • Local staging

    When you migrate data using ossimport, the data stream is staged in the local memory before it is uploaded to the destination.

  • Source data retention

    When you migrate data using ossimport, it only performs read operations on the source data. It does not modify or delete the source data.

  • Other migration tools (ossutil)

    To migrate less than 30 TB of data, we recommend using ossutil. This tool is simple and convenient. You can use the -u, --update and --snapshot-path options to perform incremental file migration. For more information, see cp.

Runtime environment

You can deploy ossimport on a Linux or Windows system that meets the following requirements:

  • Windows 7 or later

  • CentOS 6 or CentOS 7

  • Java 7 or Java 8

Important

Distributed deployment is not supported on Windows systems.

Choose a deployment mode

ossimport supports two deployment modes: standalone and distributed.

  • Standalone mode: To migrate less than 30 TB of data, we recommend that you deploy in standalone mode. You can deploy ossimport on any machine that can access the data you want to migrate and can access OSS.

  • Distributed mode: To migrate more than 30 TB of data, we recommend that you deploy in distributed mode. You can deploy ossimport on multiple machines that can access the data you want to migrate and can access OSS.

    Note

    If the amount of data to migrate is very large, you can save time by deploying ossimport on ECS instances in the same region as your OSS bucket. Then, connect the server that stores the source data to an Alibaba Cloud VPC network using a leased line. Migrating data from multiple ECS instances to OSS over the internal network significantly improves migration efficiency.

    You can also use ossimport to transfer data over the Internet. The transfer speed is affected by the bandwidth of your local server.

Standalone mode

The Master, Worker, Tracker, and Console modules run on a single machine and are packaged into ossimport2.jar. The system has only one Worker.

The following shows the file structure in standalone mode:

ossimport
├── bin
│ └── ossimport2.jar  # The JAR package that contains the Master, Worker, Tracker, and Console modules.
├── conf
│ ├── local_job.cfg   # The job configuration file for standalone mode.
│ └── sys.properties  # The configuration file for system parameters.
├── console.bat         # The command-line tool for Windows. You can use it to execute import tasks step by step.
├── console.sh          # The command-line tool for Linux. You can use it to execute import tasks step by step.
├── import.bat          # The one-click import script for Windows. It executes the data migration task configured in conf/local_job.cfg, including starting, migrating, verifying, and retrying the task.
├── import.sh           # The one-click import script for Linux. It executes the data migration task configured in conf/local_job.cfg, including starting, migrating, verifying, and retrying the task.
├── logs                # The log directory.
└── README.md           # The README file. We strongly recommend that you read this file before use.
  • The import.bat and import.sh files are one-click import scripts. You can run them directly after you modify the local_job.cfg file.

  • The console.bat and console.sh files are command-line tools that you can use to execute commands step by step.

  • Run scripts or commands in the ossimport directory, which is the same directory as the *.bat/*.sh files.

Distributed mode

ossimport uses a distributed architecture based on a master-worker model. The following shows the structure:

Master --------- Job --------- Console
    |
    |
   TaskTracker
    |_____________________
    |Task     | Task      | Task
    |         |           |
Worker      Worker      Worker

Parameter

Description

Master

Splits a job into tasks based on the data size and number of files. You can configure the data size and number of files in the sys.properties file. The following steps describe how a job is split into tasks:

  1. The Master traverses the local storage or other cloud storage to get a complete list of files to be migrated.

  2. The Master splits the complete file list into tasks based on the data size and number of files. Each task is responsible for migrating or verifying a portion of the files.

Worker

  • Migrates files and verifies data for tasks. A worker pulls specified files from the data source and uploads them to the specified directory in OSS. The data source and OSS destination are specified in the job.cfg or local_job.cfg file.

  • Worker data migration supports traffic shaping and a specified number of concurrent tasks. Configure these settings in the sys.properties file.

TaskTracker

Abbreviated as Tracker. It distributes tasks and tracks task statuses.

Console

Interacts with users, accepts commands, and displays results. It supports system administration commands such as deploy, start, and stop, and job management commands such as submit, retry, and clean.

Job

A data migration task submitted by a user. For a user, one task corresponds to one job.cfg configuration file.

Task

A job can be divided into multiple tasks based on the data size and number of files. Each task migrates a portion of the files. The smallest unit for splitting a job is a single file. A file is not split across multiple tasks.

In distributed mode, you can start multiple machines for data migration. You can start only one Worker on each machine. Tasks are distributed evenly among the Workers. Each Worker can execute multiple tasks.

The following shows the file structure in distributed mode:

ossimport
├── bin
│ ├── console.jar     # The JAR package for the Console module.
│ ├── master.jar      # The JAR package for the Master module.
│ ├── tracker.jar     # The JAR package for the Tracker module.
│ └── worker.jar      # The JAR package for the Worker module.
├── conf
│ ├── job.cfg         # The job configuration file template.
│ ├── sys.properties  # The configuration file for system parameters.
│ └── workers         # The list of workers.
├── console.sh          # The command-line tool. Only Linux is currently supported.
├── logs                # The log directory.
└── README.md           # The README file. We strongly recommend that you read this file before use.

Configuration files

In standalone mode, there are two configuration files: sys.properties and local_job.cfg. In distributed mode, there are three configuration files: sys.properties, job.cfg, and workers. The configuration items in local_job.cfg and job.cfg are identical. They only differ in name. The workers file is specific to the distributed environment.

  • sys.properties: System runtime parameters

    Parameter

    Description

    Details

    workingDir

    Working directory

    The directory where the tool package is decompressed. Do not change this parameter in standalone mode. In distributed mode, the working directory must be the same on each machine.

    workerUser

    The SSH username for the worker machine

    • If `privateKeyFile` is configured, it is used with priority.

    • If `privateKeyFile` is not configured, `workerUser` and `workerPassword` are used.

    • Do not change this parameter in standalone mode.

    workerPassword

    The SSH password for the worker machine

    Do not change this parameter in standalone mode.

    privateKeyFile

    The path of the private key file

    • If an SSH channel is already established, you can specify this parameter. Otherwise, leave it empty.

    • If `privateKeyFile` is configured, it is used with priority.

    • If `privateKeyFile` is not configured, `workerUser` and `workerPassword` are used.

    • Do not change this parameter in standalone mode.

    sshPort

    The SSH port

    The default value is 22. Do not change this value unless necessary. Do not change this parameter in standalone mode.

    workerTaskThreadNum

    The maximum number of threads for a worker to execute tasks

    • This parameter depends on the machine's memory and network. The recommended value is 60.

    • You can increase this value for a physical machine, for example, to 150. If the network bandwidth is fully utilized, do not increase the value further.

    • If the network connection is poor, decrease this value to a number such as 30. This helps prevent request timeouts caused by network contention.

    workerMaxThroughput(KB/s)

    The maximum traffic throughput for data migration on a worker

    This value can be used for traffic shaping. The default value is 0, which means no traffic shaping.

    dispatcherThreadNum

    The number of threads for the tracker to dispatch tasks and check statuses

    The default value is usually sufficient. Do not change the default value unless necessary.

    workerAbortWhenUncatchedException

    Specifies whether to skip or terminate a task when an unknown error occurs

    By default, the task is skipped.

    workerRecordMd5

    Specifies whether to use the x-oss-meta-md5 metadata to record the MD5 hash of migrated files in OSS. By default, the MD5 hash is not recorded.

    This is used for MD5 validation of files.

  • job.cfg: Data migration task configuration. The configuration items in local_job.cfg and job.cfg are identical. They only differ in name.

    Parameter

    Description

    Details

    jobName

    The name of the task. String.

    • The unique identifier of the task. The name must follow the naming convention `[a-zA-Z0-9_-]{4,128}`. You can submit multiple tasks with different names.

    • If you submit a task with the same name as an existing task, a message appears indicating that the task already exists. You cannot submit a task with the same name until the existing task is cleaned.

    jobType

    The type of the task. String.

    • The default value is import.

    isIncremental

    Specifies whether to enable incremental migration mode. Boolean.

    • The default value is false.

    • If set to true, the system rescans for incremental data at the interval specified by `incrementalModeInterval` (in seconds) and migrates the incremental data to OSS.

    incrementalModeInterval

    The synchronization interval in incremental mode. Integer. Unit: seconds.

    This parameter is valid only when `isIncremental` is set to true. The minimum configurable interval is 900 seconds. We recommend that you do not set this value to less than 3600 seconds. This prevents excessive requests and extra system overhead.

    importSince

    Migrates data modified after this time. Integer. Unit: seconds.

    • This time is a UNIX timestamp, which is the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970. Obtain the timestamp by running the `date +%s` command.

    • The default value is 0, which means all data is migrated.

    srcType

    The type of the data source. String. This parameter is case-sensitive.

    The following types are supported:

    • local: Migrates data from local files to OSS. For this option, you only need to specify `srcPrefix`. You do not need to specify `srcAccessKey`, `srcSecretKey`, `srcDomain`, or `srcBucket`.

    • oss: Migrates data from one OSS bucket to another.

    • qiniu: Migrates data from Qiniu Kodo to OSS.

    • bos: Migrates data from Baidu BOS to OSS.

    • ks3: Migrates data from Kingsoft KS3 to OSS.

    • s3: Migrates data from Amazon S3 to OSS.

    • youpai: Migrates data from UPYUN USS to OSS.

    • http: Migrates data to OSS from a provided list of HTTP or HTTPS links.

    • cos: Migrates data from Tencent Cloud COS to OSS.

    • azure: Migrates data from Azure BLOB to OSS.

    srcAccessKey

    The AccessKey ID of the source. String.

    • If `srcType` is set to oss, qiniu, baidu, ks3, or s3, enter the AccessKey ID of the data source.

    • If `srcType` is set to local or http, you do not need to specify this parameter.

    • If `srcType` is set to youpai or azure, enter the username (AccountName).

    srcSecretKey

    The AccessKey secret of the source. String.

    • If `srcType` is set to oss, qiniu, baidu, ks3, or s3, enter the AccessKey secret of the data source.

    • If `srcType` is set to local or http, you do not need to specify this parameter.

    • If `srcType` is set to youpai, enter the operator password.

    • If `srcType` is set to azure, enter the AccountKey.

    srcDomain

    The endpoint of the source

    • If `srcType` is set to local or http, you do not need to specify this parameter.

    • If `srcType` is set to oss, enter the endpoint obtained from the console. This is not a second-level domain name that includes a bucket prefix.

    • If `srcType` is set to qiniu, enter the endpoint of the corresponding bucket obtained from the Qiniu console.

    • If `srcType` is set to `bos`, enter the Baidu BOS endpoint, such as http://bj.bcebos.com or http://gz.bcebos.com.

    • If `srcType` is set to `ks3`, enter the Kingsoft KS3 endpoint, such as http://kss.ksyun.com, http://ks3-cn-beijing.ksyun.com, or http://ks3-us-west-1.ksyun.coms.

    • If `srcType` is set to S3, enter the endpoint for the Amazon S3 region.

    • If `srcType` is set to youpai, enter the UPYUN endpoint. For example, use http://v0.api.upyun.com to automatically determine the optimal line, http://v1.api.upyun.com for the China Telecom or China Netcom line, http://v2.api.upyun.com for the China Unicom or China Netcom line, or http://v3.api.upyun.com for the China Mobile or China Tietong line.

    • If `srcType` is set to cos, enter the region where the Tencent Cloud bucket is located. For example, the South China region is `ap-guangzhou`.

    • If `srcType` is set to azure, enter the `EndpointSuffix` value of the Azure Blob connection string, such as core.chinacloudapi.cn.

    srcBucket

    The name of the source bucket or container

    • If `srcType` is set to local or http, you do not need to specify this parameter.

    • If `srcType` is set to azure, enter the container name.

    • For other source types, enter the bucket name.

    srcPrefix

    The prefix of the source objects. String. Default: empty.

    • If `srcType` is set to `local`, enter the local directory. You must provide the full path, use single forward slashes (/) as separators, and end the path with a single forward slash (/). Only formats such as c:/example/ or /data/example/ are supported.

      Important

      Paths such as `c:/example//`, `/data//example/`, or `/data/example//` are invalid.

    • If `srcType` is set to oss, qiniu, bos, ks3, youpai, or s3, enter the prefix of the objects to be synchronized, excluding the bucket name. For example, data/to/oss/.

    • To synchronize all files, leave `srcPrefix` empty.

    destAccessKey

    The AccessKey ID of the destination. String.

    The AccessKey ID used to access the OSS service. To view it, go to the Alibaba Cloud Management Console.

    destSecretKey

    The AccessKey secret of the destination. String.

    The AccessKey secret used to access OSS. To view it, go to the Alibaba Cloud Management Console.

    destDomain

    The endpoint of the destination. String.

    Obtain it from the Alibaba Cloud Management Console. This is not a second-level domain name that includes a bucket prefix. For a list of endpoints, see Domain Names.

    destBucket

    The destination bucket. String.

    The name of the OSS bucket. It does not need to end with a forward slash (/).

    destPrefix

    The prefix for destination objects. String. Default: empty.

    • The prefix for destination objects. By default, this parameter is empty, and objects are placed directly in the destination bucket.

    • If you want to synchronize data to a specific directory in OSS, end the prefix with a forward slash (/), such as data/in/oss/.

    • Note that OSS does not support object names that start with a forward slash (/). Therefore, do not configure `destPrefix` to start with a forward slash (/).

    • A local file with the path srcPrefix+relativePath is migrated to OSS with the path destDomain/destBucket/destPrefix +relativePath.

    • A cloud file with the path srcDomain/srcBucket/srcPrefix+relativePath is migrated to OSS with the path destDomain/destBucket/destPrefix+relativePath.

    srcSignatureVersion

    The signature version of the source.

    This parameter applies only to CloudBox.

    Value: `oss_signature_v4`, which indicates OSS V4 signature.

    destSignatureVersion

    The signature version of the destination.

    This parameter applies only to CloudBox.

    Value: `oss_signature_v4`, which indicates OSS V4 signature.

    srcRegion

    The region of the source.

    This parameter applies only to CloudBox.

    destRegion

    The region of the destination.

    This parameter applies only to CloudBox.

    srcCloudBoxId

    The ID of the source CloudBox.

    This parameter applies only to CloudBox.

    destCloudBoxId

    The ID of the destination CloudBox.

    This parameter applies only to CloudBox.

    taskObjectCountLimit

    The maximum number of files for each task. Integer. Default: 10000.

    This parameter affects the degree of parallelism for task execution. A general formula for this value is: (Total number of files) / (Total number of Workers) / (`workerTaskThreadNum`). The maximum value is 50000. If you do not know the total number of files, use the default value.

    taskObjectSizeLimit

    The maximum data size for each task. Integer. Unit: bytes. Default: 1 GB.

    This parameter affects the degree of parallelism for task execution. A general formula for this value is: (Total data size) / (Total number of Workers) / (`workerTaskThreadNum`). If you do not know the total data size, use the default value.

    isSkipExistFile

    Specifies whether to skip existing files during data migration. Boolean.

    If set to `true`, the system determines whether to skip a file based on its size and `LastModifiedTime`. If set to `false`, it always overwrites existing files in OSS. The default value is `false`.

    scanThreadCount

    The number of threads for concurrent file scanning. Integer.

    • Default value: 1

    • Valid values: 1 to 32

    This parameter affects the efficiency of file scanning. Do not change it unless you have specific requirements.

    maxMultiThreadScanDepth

    The maximum depth of directories for concurrent scanning. Integer.

    • Default value: 1

    • Valid values: 1 to 16

    • The default value of 1 means concurrent scanning between top-level directories.

    • Do not change this value unless you have specific requirements. An excessively large value may cause the task to fail.

    appId

    The app ID for Tencent Cloud COS. Integer.

    This parameter is valid only when `srcType` is set to `cos`.

    httpListFilePath

    The absolute path of the file that contains a list of HTTP URLs. String.

    • This parameter is valid only when `srcType` is set to `http`. When the source is a list of HTTP URLs, you must provide the absolute path to the file containing the HTTP URLs, such as `c:/example/http.list`.

    • The HTTP links in this file must be in two columns, separated by one or more spaces. The columns represent the source URL prefix and the destination relative path in OSS. For example, if the `c:/example/http.list` file contains:

      http://xxx.xxx.com/aa/  bb.jpg
      http://xxx.xxx.com/cc/  dd.jpg

      If you specify `destPrefix` as `ee/`, the files migrated to OSS will be named as follows:

      ee/bb.jpg
      ee/dd.jpg
  • workers: Specific to distributed mode. Enter one IP address per line. For example:

    192.168.1.6
    192.168.1.7
    192.168.1.8
    • In the configuration above, the first line, 192.168.1.6, must be the master. This means that the Master, Worker, and TaskTracker will all start on 192.168.1.6. The Console must also run on this machine.

    • Ensure that multiple Worker machines have the same username, logon method, and working directory.

Configuration file examples

The following table provides configuration files for data migration tasks in distributed mode. The configuration file for standalone mode is named local_job.cfg, and its configuration items are the same as those for distributed mode.

Migration type

Configuration file

Description

Migrate data from a local server to OSS

job.cfg

`srcPrefix` must be an absolute path that ends with a forward slash (/), such as D:/work/oss/data/ or /home/user/work/oss/data/.

Migrate data from Qiniu Kodo to OSS

job.cfg

`srcPrefix` and `destPrefix` can be empty. If they are not empty, they must end with a forward slash (/), such as destPrefix=docs/.

Migrate data from Baidu BOS to OSS

job.cfg

`srcPrefix` and `destPrefix` can be empty. If they are not empty, they must end with a forward slash (/), such as destPrefix=docs/.

Migrate data from Amazon S3 to OSS

job.cfg

For more information, see S3 endpoints.

Migrate data from UPYUN USS to OSS

job.cfg

Set `srcAccessKey` and `srcSecretKey` to the operator account and password.

Migrate data from Tencent Cloud COS to OSS

job.cfg

Set `srcDomain` based on the V4 version requirement, for example, srcDomain=sh. `srcPrefix` can be empty. If it is not empty, it must start and end with a forward slash (/), such as srcPrefix=/docs/.

Migrating from Azure Blob to OSS

job.cfg

Set `srcAccessKey` and `srcSecretKey` to the storage account and key. Set `srcDomain` to the `EndpointSuffix` in the connection string, such as core.chinacloudapi.cn.

Migrate data from OSS to OSS

job.cfg

This is suitable for data migration between different regions, between different storage classes, or between different prefixes. We recommend deploying on an ECS instance and using an internal endpoint to reduce traffic costs.

Advanced Settings

  • Time-based throttling

    In sys.properties, workerMaxThroughput(KB/s) represents the upper limit of a Worker's traffic. If your business requires traffic shaping, such as for source-side throttling or network limitations, set this parameter. The value of this parameter should be less than the machine's maximum network traffic and should be evaluated based on business needs. After changing the value, you must restart the service for it to take effect.

    In a distributed deployment, you need to modify the sys.properties file in the $OSS_IMPORT_WORK_DIR/conf directory of each Worker, and then restart the service.

    To implement time-based throttling, you can use crontab to schedule modifications to sys.properties and then restart the service for the changes to take effect.

  • Change task concurrency

    • In sys.properties, workerTaskThreadNum represents the number of tasks a Worker can execute concurrently. If the network connection is poor and concurrency is high, many timeout errors may occur. In this case, you can change this parameter to reduce the concurrency level and then restart the service.

    • In sys.properties, workerMaxThroughput(KB/s) represents the upper limit of a Worker's traffic. If your business requires traffic shaping, such as for source-side throttling or network limitations, set this parameter. The value of this parameter should be less than the machine's maximum network traffic and should be evaluated based on business needs.

    • In job.cfg, taskObjectCountLimit is the maximum number of files per task, with a default of 10000. This parameter affects the number of tasks. A number that is too small cannot achieve effective concurrency.

    • In job.cfg, taskObjectSizeLimit is the maximum data size per task, with a default of 1 GB. This parameter affects the number of tasks. A number that is too small cannot achieve effective concurrency.

      Important
      • Configure all parameters before you start the migration task.

      • After you change parameters in `sys.properties`, you must restart the migration server for the changes to take effect.

      • After a `job.cfg` task is submitted, its configuration parameters cannot be changed.

  • Verify data without migration

    ossimport supports verifying data without migrating it. To do this, set the jobType configuration item in the job.cfg or local_job.cfg file to audit instead of import. Other configurations are the same as for data migration.

  • Incremental data migration mode

    In incremental data migration mode, the task first performs a full migration. Then, it automatically performs incremental data migrations at the specified interval. The initial full migration starts immediately after the task is submitted. Subsequent incremental migrations run once per cycle. This mode is suitable for data backup and data synchronization.

    There are two configuration items for incremental mode:

    • In job.cfg, isIncremental specifies whether to enable incremental migration mode. true enables incremental mode, and false disables it. The default is disabled.

    • In job.cfg, incrementalModeInterval specifies the synchronization interval in incremental mode, which is the period between incremental data migrations, in seconds. This is valid only when isIncremental=true. The minimum configurable value is 900 seconds. We recommend that you do not set this value to less than 3600 seconds. This prevents excessive requests and extra system overhead.

  • Specify filter conditions for migration

    Filter conditions allow you to migrate only files that meet specific criteria. ossimport supports filtering by prefix and last modified time:

    • In job.cfg, srcPrefix is used to specify the prefix of the files to migrate. It is empty by default.

      • If srcType=local, enter the local directory. You must provide the full path, use single forward slashes (/) as separators, and end the path with a single forward slash (/), such as c:/example/ or /data/example/.

      • If srcType is oss, qiniu, bos, ks3, youpai, or s3, this is the prefix of the objects to be synchronized, excluding the bucket name. For example, data/to/oss/. To migrate all files, set srcPrefix to empty.

    • In job.cfg, importSince is used to specify the last modified time of the files to migrate, in seconds. importSince is a UNIX timestamp, which is the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970. Obtain the timestamp by running the date +%s command. The default value is 0, which means all data is migrated. In incremental mode, this parameter applies only to the first full migration. In non-incremental mode, it applies to the entire migration task.

      • If a file's LastModified Time is before importSince, the file is not migrated.

      • If a file's LastModified Time is after importSince, the file will be migrated.