DataWorks supports FTP as a data source to read data from and write data to FTP servers. This topic describes the capabilities of DataWorks for FTP data synchronization.
Limitations
FTP Reader reads data from remote FTP files and converts it into the Data Integration protocol. Remote FTP files are stored as unstructured data. The following table describes the features that FTP Reader supports for data synchronization.
|
Supported |
Unsupported |
|
|
FTP Writer converts data based on the Data Integration protocol and writes the data to files on an FTP server. Remote FTP files are stored as unstructured data. The following table describes the features that are supported by FTP Writer.
|
Supported |
Unsupported |
|
|
Supported data types
Remote FTP files have no native data types. Instead, the data types are defined by the FTP Reader in Data Integration.
|
Data Integration type |
FTP type |
|
LONG |
LONG |
|
DOUBLE |
DOUBLE |
|
STRING |
STRING |
|
BOOLEAN |
BOOLEAN |
|
DATE |
DATE |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Single-table batch synchronization
-
For more information about the configuration procedure, see Configure a batch synchronization task using the Codeless UI and Configure a batch synchronization task using script mode.
-
For all parameters and a script sample for configuring a batch task in script mode, see Appendix: Script sample and parameter description.
Appendix: Script sample and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script sample
{
"type":"job",
"version":"2.0",// Version number.
"steps":[
{
"stepType":"ftp",// Plug-in name.
"parameter":{
"path":[],// File path.
"nullFormat":"",// Null value.
"compress":"",// Compression format.
"datasource":"",// Data source.
"column":[// Columns.
{
"index":0,// ID.
"type":""// Data type.
}
],
"skipHeader":"",// Specifies whether to include a header.
"fieldDelimiter":",",// Column delimiter.
"encoding":"UTF-8",// Encoding format.
"fileFormat":"csv"// File format.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// Maximum number of allowed dirty data records.
},
"speed":{
"throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
"concurrent":1, // Job concurrency.
"mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Reader script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the data source, which must match the name of the data source you configured in DataWorks. |
Yes |
None |
|
path |
The path of the source file on the remote FTP file system. You must specify the full path of the source file, including the file name extension. You can specify multiple paths.
Note
|
Yes |
None |
|
column |
The list of columns to read. By default, you can read all data as the STRING type by using the
For the specified column, you must specify |
Yes |
None |
|
fieldDelimiter |
The delimiter that is used to separate columns in the source files. Note
You must specify a delimiter for FTP Reader to read data. If you do not specify a delimiter, the default comma (,) is used. In the UI, a comma (,) is also specified by default. |
Yes |
, |
|
skipHeader |
A CSV-like file may have a header. You can skip the header. By default, the header is not skipped. This parameter is not supported for compressed files. |
No |
false |
|
encoding |
The encoding of the source files. |
No |
utf-8 |
|
nullFormat |
Text files cannot use standard strings for nulls (null pointers). Use
|
No |
None |
|
markDoneFileName |
The mark-done file name. Before a data synchronization task starts, the system checks whether the mark-done file exists. If the file does not exist, the system waits for a specific period of time and checks again. The task starts only after the file is detected. |
No |
None |
|
maxRetryTime |
The number of retries for the mark-done file check. The default value is 60. The interval between retries is 1 minute, which results in a total wait time of 60 minutes. |
No |
60 |
|
csvReaderConfig |
The parameters that are used to read CSV files. The value must be of the Map type. CSV files are read by CsvReader. Multiple configurations are available. If you do not configure this parameter, default values are used. |
No |
None |
|
fileFormat |
The format of the source files. By default, files are read as CSV files and are processed as logical two-dimensional tables. If you set this parameter to This parameter is typically used for mirroring directory structures between storage systems such as FTP and OSS. In most cases, you do not need to configure this parameter. |
No |
None |
Writer script sample
{
"type":"job",
"version":"2.0",// Version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"ftp",// Plug-in name.
"parameter":{
"path":"",// File path.
"fileName":"",// File name.
"nullFormat":"null",// Null value.
"dateFormat":"yyyy-MM-dd HH:mm:ss",// Date format.
"datasource":"",// Data source.
"writeMode":"",// Write mode.
"fieldDelimiter":",",// Column delimiter.
"encoding":"",// Encoding format.
"fileFormat":""// File format.
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// Maximum number of allowed dirty data records.
},
"speed":{
"throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
"concurrent":1, // Job concurrency.
"mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Writer script parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the data source, which must match the name of the data source you configured in DataWorks. |
Yes |
None |
|
timeout |
The timeout period for connecting to the FTP server. Unit: milliseconds. |
No |
60,000 (1 minute) |
|
path |
The destination path on the FTP file system. FTP Writer writes multiple files to the directory that is specified by this parameter. |
Yes |
None |
|
fileName |
The name of the file to which FTP Writer writes data. A random suffix is appended to this file name to create the actual file name for each thread. |
Yes |
None |
|
singleFileOutput |
By default, FTP Writer appends a random suffix to the specified |
No |
false |
|
writeMode |
The mode for clearing data before writing.
|
Yes |
None |
|
fieldDelimiter |
The delimiter that is used to separate columns in the destination files. |
Yes (single character only) |
None |
|
skipHeader |
CSV-likeCSVCSV files may have headers as titles,need to be skipped。By default, not skipped,Not supported in compressed file modeskipHeader。 |
No |
false |
|
compress |
Supportsgzipandbzip2two compression formats。 |
No |
No compression |
|
encoding |
The encoding of the destination files. |
No |
utf-8 |
|
nullFormat |
Text files cannot use standard strings for nulls (null pointers). Use For example, if you configure |
No |
None |
|
dateFormat |
The format for serializing DATE-type data into a file. Example: "dateFormat":"yyyy-MM-dd". |
No |
None |
|
fileFormat |
The format of the destination files. Valid values: CSV and TEXT. CSV is a strict format. If data to be written contains a column delimiter, the data is escaped with double quotation marks (") based on the CSV escape rule. TEXT is a simple format that uses a column delimiter to separate data. Delimiters in the data are not escaped. |
No |
TEXT |
|
header |
The header to write to the text file. In script mode, you can configure the header information. For example, you can set this parameter to "header":["id","name","age"]. Then, |
No |
None |
|
markDoneFileName |
|
No |
None |