Overview of ossimport
ossimport is a tool for migrating data to Object Storage Service (OSS). You can deploy ossimport on a local server or an Elastic Compute Service (ECS) instance to migrate data from local storage or other cloud storage services to OSS. Alibaba Cloud also provides standardized Data Online Migration and cross-region replication services. These services offer a graphical user interface (GUI) to help you migrate data to the cloud and replicate data between OSS buckets.
ossimport does not verify files after migration. It cannot guarantee the correctness or consistency of the migration results. After a migration task is complete, you must verify the data consistency between the source and destination.
If you delete the source data before you verify data consistency, you are liable for any resulting data loss. For more information, see Data Transport Service Agreement.
To migrate data from other third-party data sources, we recommend that you use Data Online Migration to migrate your data.
To perform near-real-time replication of OSS data between two different buckets, we recommend that you use the OSS cross-region replication feature.
Features
Supports a wide range of data sources, such as local files, Qiniu Kodo, Baidu BOS, Amazon S3, Azure Blob, UPYUN USS, Tencent Cloud COS, Kingsoft KS3, lists of HTTP or HTTPS URLs, and Alibaba Cloud OSS. It can also be extended to support other sources as needed.
Supports standalone and distributed modes. The standalone mode is simple to deploy and easy to use. The distributed mode is suitable for large-scale data migration.
NoteIn standalone mode, you can migrate only one bucket at a time.
Supports resumable uploads.
Supports traffic shaping.
Supports migrating files modified after a specific time or files with a specific prefix.
Supports concurrent data downloads and uploads.
Billing
The ossimport tool is free of charge. In a public cloud environment, the data source may incur costs such as outbound traffic fees and request fees. The destination OSS bucket will incur request fees. If you enable and use transfer acceleration for OSS, transfer acceleration fees will also be incurred.
Usage notes
Migration speed
The migration speed of ossimport depends on factors such as the read bandwidth of the source, local network bandwidth, and file sizes. Migrating files smaller than 200 KB might be slow because of high input/output operations per second (IOPS) usage.
Source data is in the Archive storage class
When you migrate data using ossimport, if the source data is in the Archive storage class, you must restore the data before you start the migration.
Local staging
When you migrate data using ossimport, the data stream is staged in the local memory before it is uploaded to the destination.
Source data retention
When you migrate data using ossimport, it only performs read operations on the source data. It does not modify or delete the source data.
Other migration tools (ossutil)
To migrate less than 30 TB of data, we recommend using ossutil. This tool is simple and convenient. You can use the -u, --update and --snapshot-path options to perform incremental file migration. For more information, see cp.
Runtime environment
You can deploy ossimport on a Linux or Windows system that meets the following requirements:
Windows 7 or later
CentOS 6 or CentOS 7
Java 7 or Java 8
Distributed deployment is not supported on Windows systems.
Choose a deployment mode
ossimport supports two deployment modes: standalone and distributed.
Standalone mode: To migrate less than 30 TB of data, we recommend that you deploy in standalone mode. You can deploy ossimport on any machine that can access the data you want to migrate and can access OSS.
Distributed mode: To migrate more than 30 TB of data, we recommend that you deploy in distributed mode. You can deploy ossimport on multiple machines that can access the data you want to migrate and can access OSS.
NoteIf the amount of data to migrate is very large, you can save time by deploying ossimport on ECS instances in the same region as your OSS bucket. Then, connect the server that stores the source data to an Alibaba Cloud VPC network using a leased line. Migrating data from multiple ECS instances to OSS over the internal network significantly improves migration efficiency.
You can also use ossimport to transfer data over the Internet. The transfer speed is affected by the bandwidth of your local server.
Standalone mode
The Master, Worker, Tracker, and Console modules run on a single machine and are packaged into ossimport2.jar. The system has only one Worker.
The following shows the file structure in standalone mode:
ossimport
├── bin
│ └── ossimport2.jar # The JAR package that contains the Master, Worker, Tracker, and Console modules.
├── conf
│ ├── local_job.cfg # The job configuration file for standalone mode.
│ └── sys.properties # The configuration file for system parameters.
├── console.bat # The command-line tool for Windows. You can use it to execute import tasks step by step.
├── console.sh # The command-line tool for Linux. You can use it to execute import tasks step by step.
├── import.bat # The one-click import script for Windows. It executes the data migration task configured in conf/local_job.cfg, including starting, migrating, verifying, and retrying the task.
├── import.sh # The one-click import script for Linux. It executes the data migration task configured in conf/local_job.cfg, including starting, migrating, verifying, and retrying the task.
├── logs # The log directory.
└── README.md # The README file. We strongly recommend that you read this file before use.The import.bat and import.sh files are one-click import scripts. You can run them directly after you modify the
local_job.cfgfile.The console.bat and console.sh files are command-line tools that you can use to execute commands step by step.
Run scripts or commands in the
ossimportdirectory, which is the same directory as the*.bat/*.shfiles.
Distributed mode
ossimport uses a distributed architecture based on a master-worker model. The following shows the structure:
Master --------- Job --------- Console
|
|
TaskTracker
|_____________________
|Task | Task | Task
| | |
Worker Worker WorkerParameter | Description |
Master | Splits a job into tasks based on the data size and number of files. You can configure the data size and number of files in the sys.properties file. The following steps describe how a job is split into tasks:
|
Worker |
|
TaskTracker | Abbreviated as Tracker. It distributes tasks and tracks task statuses. |
Console | Interacts with users, accepts commands, and displays results. It supports system administration commands such as deploy, start, and stop, and job management commands such as submit, retry, and clean. |
Job | A data migration task submitted by a user. For a user, one task corresponds to one |
Task | A job can be divided into multiple tasks based on the data size and number of files. Each task migrates a portion of the files. The smallest unit for splitting a job is a single file. A file is not split across multiple tasks. |
In distributed mode, you can start multiple machines for data migration. You can start only one Worker on each machine. Tasks are distributed evenly among the Workers. Each Worker can execute multiple tasks.
The following shows the file structure in distributed mode:
ossimport
├── bin
│ ├── console.jar # The JAR package for the Console module.
│ ├── master.jar # The JAR package for the Master module.
│ ├── tracker.jar # The JAR package for the Tracker module.
│ └── worker.jar # The JAR package for the Worker module.
├── conf
│ ├── job.cfg # The job configuration file template.
│ ├── sys.properties # The configuration file for system parameters.
│ └── workers # The list of workers.
├── console.sh # The command-line tool. Only Linux is currently supported.
├── logs # The log directory.
└── README.md # The README file. We strongly recommend that you read this file before use.Configuration files
In standalone mode, there are two configuration files: sys.properties and local_job.cfg. In distributed mode, there are three configuration files: sys.properties, job.cfg, and workers. The configuration items in local_job.cfg and job.cfg are identical. They only differ in name. The workers file is specific to the distributed environment.
sys.properties: System runtime parameters
Parameter
Description
Details
workingDir
Working directory
The directory where the tool package is decompressed. Do not change this parameter in standalone mode. In distributed mode, the working directory must be the same on each machine.
workerUser
The SSH username for the worker machine
If `privateKeyFile` is configured, it is used with priority.
If `privateKeyFile` is not configured, `workerUser` and `workerPassword` are used.
Do not change this parameter in standalone mode.
workerPassword
The SSH password for the worker machine
Do not change this parameter in standalone mode.
privateKeyFile
The path of the private key file
If an SSH channel is already established, you can specify this parameter. Otherwise, leave it empty.
If `privateKeyFile` is configured, it is used with priority.
If `privateKeyFile` is not configured, `workerUser` and `workerPassword` are used.
Do not change this parameter in standalone mode.
sshPort
The SSH port
The default value is 22. Do not change this value unless necessary. Do not change this parameter in standalone mode.
workerTaskThreadNum
The maximum number of threads for a worker to execute tasks
This parameter depends on the machine's memory and network. The recommended value is 60.
You can increase this value for a physical machine, for example, to 150. If the network bandwidth is fully utilized, do not increase the value further.
If the network connection is poor, decrease this value to a number such as 30. This helps prevent request timeouts caused by network contention.
workerMaxThroughput(KB/s)
The maximum traffic throughput for data migration on a worker
This value can be used for traffic shaping. The default value is 0, which means no traffic shaping.
dispatcherThreadNum
The number of threads for the tracker to dispatch tasks and check statuses
The default value is usually sufficient. Do not change the default value unless necessary.
workerAbortWhenUncatchedException
Specifies whether to skip or terminate a task when an unknown error occurs
By default, the task is skipped.
workerRecordMd5
Specifies whether to use the x-oss-meta-md5 metadata to record the MD5 hash of migrated files in OSS. By default, the MD5 hash is not recorded.
This is used for MD5 validation of files.
job.cfg: Data migration task configuration. The configuration items in
local_job.cfgandjob.cfgare identical. They only differ in name.Parameter
Description
Details
jobName
The name of the task. String.
The unique identifier of the task. The name must follow the naming convention `[a-zA-Z0-9_-]{4,128}`. You can submit multiple tasks with different names.
If you submit a task with the same name as an existing task, a message appears indicating that the task already exists. You cannot submit a task with the same name until the existing task is cleaned.
jobType
The type of the task. String.
The default value is
import.
isIncremental
Specifies whether to enable incremental migration mode. Boolean.
The default value is false.
If set to true, the system rescans for incremental data at the interval specified by `incrementalModeInterval` (in seconds) and migrates the incremental data to OSS.
incrementalModeInterval
The synchronization interval in incremental mode. Integer. Unit: seconds.
This parameter is valid only when `isIncremental` is set to true. The minimum configurable interval is 900 seconds. We recommend that you do not set this value to less than 3600 seconds. This prevents excessive requests and extra system overhead.
importSince
Migrates data modified after this time. Integer. Unit: seconds.
This time is a UNIX timestamp, which is the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970. Obtain the timestamp by running the `date +%s` command.
The default value is 0, which means all data is migrated.
srcType
The type of the data source. String. This parameter is case-sensitive.
The following types are supported:
local: Migrates data from local files to OSS. For this option, you only need to specify `srcPrefix`. You do not need to specify `srcAccessKey`, `srcSecretKey`, `srcDomain`, or `srcBucket`.oss: Migrates data from one OSS bucket to another.qiniu: Migrates data from Qiniu Kodo to OSS.bos: Migrates data from Baidu BOS to OSS.ks3: Migrates data from Kingsoft KS3 to OSS.s3: Migrates data from Amazon S3 to OSS.youpai: Migrates data from UPYUN USS to OSS.http: Migrates data to OSS from a provided list of HTTP or HTTPS links.cos: Migrates data from Tencent Cloud COS to OSS.azure: Migrates data from Azure BLOB to OSS.
srcAccessKey
The AccessKey ID of the source. String.
If `srcType` is set to
oss,qiniu,baidu,ks3, ors3, enter the AccessKey ID of the data source.If `srcType` is set to
localorhttp, you do not need to specify this parameter.If `srcType` is set to
youpaiorazure, enter the username (AccountName).
srcSecretKey
The AccessKey secret of the source. String.
If `srcType` is set to
oss,qiniu,baidu,ks3, ors3, enter the AccessKey secret of the data source.If `srcType` is set to
localorhttp, you do not need to specify this parameter.If `srcType` is set to
youpai, enter the operator password.If `srcType` is set to
azure, enter the AccountKey.
srcDomain
The endpoint of the source
If `srcType` is set to
localorhttp, you do not need to specify this parameter.If `srcType` is set to
oss, enter the endpoint obtained from the console. This is not a second-level domain name that includes a bucket prefix.If `srcType` is set to
qiniu, enter the endpoint of the corresponding bucket obtained from the Qiniu console.If `srcType` is set to `bos`, enter the Baidu BOS endpoint, such as
http://bj.bcebos.comorhttp://gz.bcebos.com.If `srcType` is set to `ks3`, enter the Kingsoft KS3 endpoint, such as
http://kss.ksyun.com,http://ks3-cn-beijing.ksyun.com, orhttp://ks3-us-west-1.ksyun.coms.If `srcType` is set to
S3, enter the endpoint for the Amazon S3 region.If `srcType` is set to
youpai, enter the UPYUN endpoint. For example, usehttp://v0.api.upyun.comto automatically determine the optimal line,http://v1.api.upyun.comfor the China Telecom or China Netcom line,http://v2.api.upyun.comfor the China Unicom or China Netcom line, orhttp://v3.api.upyun.comfor the China Mobile or China Tietong line.If `srcType` is set to
cos, enter the region where the Tencent Cloud bucket is located. For example, the South China region is `ap-guangzhou`.If `srcType` is set to
azure, enter the `EndpointSuffix` value of the Azure Blob connection string, such as core.chinacloudapi.cn.
srcBucket
The name of the source bucket or container
If `srcType` is set to
localorhttp, you do not need to specify this parameter.If `srcType` is set to
azure, enter the container name.For other source types, enter the bucket name.
srcPrefix
The prefix of the source objects. String. Default: empty.
If `srcType` is set to `local`, enter the local directory. You must provide the full path, use single forward slashes (/) as separators, and end the path with a single forward slash (/). Only formats such as
c:/example/or/data/example/are supported.ImportantPaths such as `c:/example//`, `/data//example/`, or `/data/example//` are invalid.
If `srcType` is set to
oss,qiniu,bos,ks3,youpai, ors3, enter the prefix of the objects to be synchronized, excluding the bucket name. For example,data/to/oss/.To synchronize all files, leave `srcPrefix` empty.
destAccessKey
The AccessKey ID of the destination. String.
The AccessKey ID used to access the OSS service. To view it, go to the Alibaba Cloud Management Console.
destSecretKey
The AccessKey secret of the destination. String.
The AccessKey secret used to access OSS. To view it, go to the Alibaba Cloud Management Console.
destDomain
The endpoint of the destination. String.
Obtain it from the Alibaba Cloud Management Console. This is not a second-level domain name that includes a bucket prefix. For a list of endpoints, see Domain Names.
destBucket
The destination bucket. String.
The name of the OSS bucket. It does not need to end with a forward slash (/).
destPrefix
The prefix for destination objects. String. Default: empty.
The prefix for destination objects. By default, this parameter is empty, and objects are placed directly in the destination bucket.
If you want to synchronize data to a specific directory in
OSS, end the prefix with a forward slash (/), such asdata/in/oss/.Note that
OSSdoes not support object names that start with a forward slash (/). Therefore, do not configure `destPrefix` to start with a forward slash (/).A local file with the path srcPrefix+relativePath is migrated to
OSSwith the path destDomain/destBucket/destPrefix +relativePath.A cloud file with the path srcDomain/srcBucket/srcPrefix+relativePath is migrated to
OSSwith the path destDomain/destBucket/destPrefix+relativePath.
srcSignatureVersion
The signature version of the source.
This parameter applies only to CloudBox.
Value: `oss_signature_v4`, which indicates OSS V4 signature.
destSignatureVersion
The signature version of the destination.
This parameter applies only to CloudBox.
Value: `oss_signature_v4`, which indicates OSS V4 signature.
srcRegion
The region of the source.
This parameter applies only to CloudBox.
destRegion
The region of the destination.
This parameter applies only to CloudBox.
srcCloudBoxId
The ID of the source CloudBox.
This parameter applies only to CloudBox.
destCloudBoxId
The ID of the destination CloudBox.
This parameter applies only to CloudBox.
taskObjectCountLimit
The maximum number of files for each task. Integer. Default: 10000.
This parameter affects the degree of parallelism for task execution. A general formula for this value is: (Total number of files) / (Total number of Workers) / (`workerTaskThreadNum`). The maximum value is 50000. If you do not know the total number of files, use the default value.
taskObjectSizeLimit
The maximum data size for each task. Integer. Unit: bytes. Default: 1 GB.
This parameter affects the degree of parallelism for task execution. A general formula for this value is: (Total data size) / (Total number of Workers) / (`workerTaskThreadNum`). If you do not know the total data size, use the default value.
isSkipExistFile
Specifies whether to skip existing files during data migration. Boolean.
If set to `true`, the system determines whether to skip a file based on its size and `LastModifiedTime`. If set to `false`, it always overwrites existing files in OSS. The default value is `false`.
scanThreadCount
The number of threads for concurrent file scanning. Integer.
Default value: 1
Valid values: 1 to 32
This parameter affects the efficiency of file scanning. Do not change it unless you have specific requirements.
maxMultiThreadScanDepth
The maximum depth of directories for concurrent scanning. Integer.
Default value: 1
Valid values: 1 to 16
The default value of 1 means concurrent scanning between top-level directories.
Do not change this value unless you have specific requirements. An excessively large value may cause the task to fail.
appId
The app ID for Tencent Cloud COS. Integer.
This parameter is valid only when `srcType` is set to `cos`.
httpListFilePath
The absolute path of the file that contains a list of HTTP URLs. String.
This parameter is valid only when `srcType` is set to `http`. When the source is a list of HTTP URLs, you must provide the absolute path to the file containing the HTTP URLs, such as `c:/example/http.list`.
The HTTP links in this file must be in two columns, separated by one or more spaces. The columns represent the source URL prefix and the destination relative path in OSS. For example, if the `c:/example/http.list` file contains:
http://xxx.xxx.com/aa/ bb.jpg http://xxx.xxx.com/cc/ dd.jpgIf you specify `destPrefix` as `ee/`, the files migrated to OSS will be named as follows:
ee/bb.jpg ee/dd.jpg
workers: Specific to distributed mode. Enter one IP address per line. For example:
192.168.1.6 192.168.1.7 192.168.1.8In the configuration above, the first line,
192.168.1.6, must be the master. This means that the Master, Worker, and TaskTracker will all start on192.168.1.6. The Console must also run on this machine.Ensure that multiple Worker machines have the same username, logon method, and working directory.
Configuration file examples
The following table provides configuration files for data migration tasks in distributed mode. The configuration file for standalone mode is named local_job.cfg, and its configuration items are the same as those for distributed mode.
Migration type | Configuration file | Description |
Migrate data from a local server to OSS | `srcPrefix` must be an absolute path that ends with a forward slash (/), such as | |
Migrate data from Qiniu Kodo to OSS | `srcPrefix` and `destPrefix` can be empty. If they are not empty, they must end with a forward slash (/), such as | |
Migrate data from Baidu BOS to OSS | `srcPrefix` and `destPrefix` can be empty. If they are not empty, they must end with a forward slash (/), such as | |
Migrate data from Amazon S3 to OSS | For more information, see S3 endpoints. | |
Migrate data from UPYUN USS to OSS | Set `srcAccessKey` and `srcSecretKey` to the operator account and password. | |
Migrate data from Tencent Cloud COS to OSS | Set `srcDomain` based on the V4 version requirement, for example, | |
Migrating from Azure Blob to OSS | Set `srcAccessKey` and `srcSecretKey` to the storage account and key. Set `srcDomain` to the `EndpointSuffix` in the connection string, such as | |
Migrate data from OSS to OSS | This is suitable for data migration between different regions, between different storage classes, or between different prefixes. We recommend deploying on an ECS instance and using an internal endpoint to reduce traffic costs. |
Advanced Settings
Time-based throttling
In sys.properties, workerMaxThroughput(KB/s) represents the upper limit of a Worker's traffic. If your business requires traffic shaping, such as for source-side throttling or network limitations, set this parameter. The value of this parameter should be less than the machine's maximum network traffic and should be evaluated based on business needs. After changing the value, you must restart the service for it to take effect.
In a distributed deployment, you need to modify the sys.properties file in the $OSS_IMPORT_WORK_DIR/conf directory of each Worker, and then restart the service.
To implement time-based throttling, you can use crontab to schedule modifications to sys.properties and then restart the service for the changes to take effect.
Change task concurrency
In sys.properties, workerTaskThreadNum represents the number of tasks a Worker can execute concurrently. If the network connection is poor and concurrency is high, many timeout errors may occur. In this case, you can change this parameter to reduce the concurrency level and then restart the service.
In sys.properties, workerMaxThroughput(KB/s) represents the upper limit of a Worker's traffic. If your business requires traffic shaping, such as for source-side throttling or network limitations, set this parameter. The value of this parameter should be less than the machine's maximum network traffic and should be evaluated based on business needs.
In job.cfg, taskObjectCountLimit is the maximum number of files per task, with a default of 10000. This parameter affects the number of tasks. A number that is too small cannot achieve effective concurrency.
In job.cfg, taskObjectSizeLimit is the maximum data size per task, with a default of 1 GB. This parameter affects the number of tasks. A number that is too small cannot achieve effective concurrency.
ImportantConfigure all parameters before you start the migration task.
After you change parameters in `sys.properties`, you must restart the migration server for the changes to take effect.
After a `job.cfg` task is submitted, its configuration parameters cannot be changed.
Verify data without migration
ossimport supports verifying data without migrating it. To do this, set the jobType configuration item in the job.cfg or local_job.cfg file to audit instead of import. Other configurations are the same as for data migration.
Incremental data migration mode
In incremental data migration mode, the task first performs a full migration. Then, it automatically performs incremental data migrations at the specified interval. The initial full migration starts immediately after the task is submitted. Subsequent incremental migrations run once per cycle. This mode is suitable for data backup and data synchronization.
There are two configuration items for incremental mode:
In job.cfg, isIncremental specifies whether to enable incremental migration mode. true enables incremental mode, and false disables it. The default is disabled.
In job.cfg, incrementalModeInterval specifies the synchronization interval in incremental mode, which is the period between incremental data migrations, in seconds. This is valid only when
isIncremental=true. The minimum configurable value is900 seconds. We recommend that you do not set this value to less than3600seconds. This prevents excessive requests and extra system overhead.
Specify filter conditions for migration
Filter conditions allow you to migrate only files that meet specific criteria. ossimport supports filtering by prefix and last modified time:
In job.cfg, srcPrefix is used to specify the prefix of the files to migrate. It is empty by default.
If
srcType=local, enter the local directory. You must provide the full path, use single forward slashes (/) as separators, and end the path with a single forward slash (/), such asc:/example/or/data/example/.If
srcTypeisoss,qiniu,bos,ks3,youpai, ors3, this is the prefix of the objects to be synchronized, excluding the bucket name. For example,data/to/oss/. To migrate all files, setsrcPrefixto empty.
In job.cfg, importSince is used to specify the last modified time of the files to migrate, in seconds. importSince is a UNIX timestamp, which is the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970. Obtain the timestamp by running the date +%s command. The default value is 0, which means all data is migrated. In incremental mode, this parameter applies only to the first full migration. In non-incremental mode, it applies to the entire migration task.
If a file's
LastModified Timeis beforeimportSince, the file is not migrated.If a file's
LastModified Timeis afterimportSince, the file will be migrated.