FTP data synchronization capabilities-DataWorks(DataWorks)-阿里云帮助中心

Limitations

FTP Reader reads data from remote FTP files and converts it into the Data Integration protocol. Remote FTP files are stored as unstructured data. The following table describes the features that FTP Reader supports for data synchronization.

Supported

Unsupported

Reads data only from text files. The schema of the data in the text files must be a two-dimensional table.
Reads data from CSV-like files with custom delimiters.
Reads data of various data types as strings. Column pruning and constant columns are supported.
Supports recursive reads and file name filtering.
Supports file compression. The following compression formats are supported: gzip, bzip2, zip, lzo, and lzo_deflate.
Supports concurrent reads from multiple files.

Concurrent read of a single file by using parallel threads. This requires an internal file-splitting algorithm.
Concurrent read of a single compressed file by using parallel threads is not supported due to technical limitations.

FTP Writer converts data based on the Data Integration protocol and writes the data to files on an FTP server. Remote FTP files are stored as unstructured data. The following table describes the features that are supported by FTP Writer.

Supported	Unsupported
Writes data only to text files. BLOB data such as videos is not supported. The schema of the data in the text files must be a two-dimensional table. Writes data to CSV-like and TEXT files with custom delimiters. Writes data by using parallel threads. Each thread writes data to a different sub-file.	Concurrent writes to a single file. Native data types. FTP Writer writes all data as the STRING data type. File compression when data is written.

Supported data types

Remote FTP files have no native data types. Instead, the data types are defined by the FTP Reader in Data Integration.

Data Integration type	FTP type
LONG	LONG
DOUBLE	DOUBLE
STRING	STRING
BOOLEAN	BOOLEAN
DATE	DATE

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Single-table batch synchronization

For more information about the configuration procedure, see Configure a batch synchronization task using the Codeless UI and Configure a batch synchronization task using script mode.
For all parameters and a script sample for configuring a batch task in script mode, see Appendix: Script sample and parameter description.

Appendix: Script sample and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script sample

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        {
            "stepType":"ftp",// Plug-in name.
            "parameter":{
                "path":[],// File path.
                "nullFormat":"",// Null value.
                "compress":"",// Compression format.
                "datasource":"",// Data source.
                "column":[// Columns.
                    {
                        "index":0,// ID.
                        "type":""// Data type.
                    }
                ],
                "skipHeader":"",// Specifies whether to include a header.
                "fieldDelimiter":",",// Column delimiter.
                "encoding":"UTF-8",// Encoding format.
                "fileFormat":"csv"// File format.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Maximum number of allowed dirty data records.
        },
        "speed":{
        "throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Reader script parameters

Parameter	Description	Required	Default
datasource	The name of the data source, which must match the name of the data source you configured in DataWorks.	Yes	None
path	The path of the source file on the remote FTP file system. You must specify the full path of the source file, including the file name extension. You can specify multiple paths. If you specify a single remote FTP file, FTP Reader can use only a single thread to extract data. Future versions of FTP Reader will support concurrent reads from a single uncompressed file by using parallel threads. If you specify multiple remote FTP files, FTP Reader can use parallel threads to extract data. The number of concurrent threads is specified by the number of channels. If you specify a wildcard, FTP Reader attempts to traverse and find multiple files. For example, if you specify /, FTP Reader reads all files in the / directory. If you specify /bazhen/, FTP Reader reads all files in the /bazhen/ directory. FTP Reader supports only the asterisk () as a file wildcard. You can also use scheduling parameters to flexibly configure file names and file paths. Note* Avoid using the asterisk (*) wildcard. This wildcard may cause a Java Virtual Machine (JVM) out-of-memory (OOM) error. Data Integration considers all text files that are synchronized in a job as a single data table. You must ensure that all files can be processed with the same schema. You must ensure that the files are in a CSV-like format and that the Data Integration system has read permissions on them. If no file that matches the specified path is found, the data synchronization task fails.	Yes	None
column	The list of columns to read. `type` specifies the data type of the source data. `index` specifies the column from which you want to read data. The value is 0-based. `value` specifies a constant column. In this case, data is not read from the source file. Instead, the system generates a column based on the value that you specify. By default, you can read all data as the STRING type by using the `"column":["*"]` configuration. You can specify the column field as follows. `{ "type": "long", "index": 0 // Reads data from the first column of the remote FTP text file as an INT field. }, { "type": "string", "value": "alibaba" // Generates a string field whose value is alibaba in FTP Reader. }` For the specified column, you must specify `type` and select either `index` or `value`.	Yes	None
fieldDelimiter	The delimiter that is used to separate columns in the source files. Note You must specify a delimiter for FTP Reader to read data. If you do not specify a delimiter, the default comma (,) is used. In the UI, a comma (,) is also specified by default.	Yes	,
skipHeader	A CSV-like file may have a header. You can skip the header. By default, the header is not skipped. This parameter is not supported for compressed files.	No	false
encoding	The encoding of the source files.	No	utf-8
nullFormat	Text files cannot use standard strings for nulls (null pointers). Use `nullFormat` to specify which strings represent null values. Examples: If you set this parameter to `nullFormat:"null"` and the source data is the string "null", Data Integration processes the source data as a null value. If you set this parameter to `nullFormat:"\u0001"` and the source data is the `\u0001` string, Data Integration processes the source data as a null value. If you do not configure the `"nullFormat"` parameter, no conversion is performed, and the source data is written to the destination as is.	No	None
markDoneFileName	The mark-done file name. Before a data synchronization task starts, the system checks whether the mark-done file exists. If the file does not exist, the system waits for a specific period of time and checks again. The task starts only after the file is detected.	No	None
maxRetryTime	The number of retries for the mark-done file check. The default value is 60. The interval between retries is 1 minute, which results in a total wait time of 60 minutes.	No	60
csvReaderConfig	The parameters that are used to read CSV files. The value must be of the Map type. CSV files are read by CsvReader. Multiple configurations are available. If you do not configure this parameter, default values are used.	No	None
fileFormat	The format of the source files. By default, files are read as CSV files and are processed as logical two-dimensional tables. If you set this parameter to `binary`, files are copied in binary format. This parameter is typically used for mirroring directory structures between storage systems such as FTP and OSS. In most cases, you do not need to configure this parameter.	No	None

Writer script sample

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        { 
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ftp",// Plug-in name.
            "parameter":{
                "path":"",// File path.
                "fileName":"",// File name.
                "nullFormat":"null",// Null value.
                "dateFormat":"yyyy-MM-dd HH:mm:ss",// Date format.
                "datasource":"",// Data source.
                "writeMode":"",// Write mode.
                "fieldDelimiter":",",// Column delimiter.
                "encoding":"",// Encoding format.
                "fileFormat":""// File format.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Maximum number of allowed dirty data records.
        },
        "speed":{
            "throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script parameters

Parameter	Description	Required	Default
datasource	The name of the data source, which must match the name of the data source you configured in DataWorks.	Yes	None
timeout	The timeout period for connecting to the FTP server. Unit: milliseconds.	No	60,000 (1 minute)
path	The destination path on the FTP file system. FTP Writer writes multiple files to the directory that is specified by this parameter.	Yes	None
fileName	The name of the file to which FTP Writer writes data. A random suffix is appended to this file name to create the actual file name for each thread.	Yes	None
singleFileOutput	By default, FTP Writer appends a random suffix to the specified `fileName` to create a unique file name for each thread. If you do not want the random suffix to be added, you can set this parameter to `true`. Then, the output file name is exactly what you specify.	No	false
writeMode	The mode for clearing data before writing. truncate: If `singleFileOutput` is `true`, files that have the same name in the destination directory are deleted before data is written. If `singleFileOutput` is `false`, all files that have the specified `fileName` prefix in the destination directory are deleted before data is written. append: No data is cleared. FTP Writer writes files directly by using the specified `fileName` and ensures that file names do not conflict. nonConflict: If a file that has the specified `fileName` prefix exists in the directory, the task fails and an error is reported.	Yes	None
fieldDelimiter	The delimiter that is used to separate columns in the destination files.	Yes (single character only)	None
skipHeader	CSV-likeCSVCSV files may have headers as titles，need to be skipped。By default, not skipped，Not supported in compressed file modeskipHeader。	No	false
compress	Supportsgzipandbzip2two compression formats。	No	No compression
encoding	The encoding of the destination files.	No	utf-8
nullFormat	Text files cannot use standard strings for nulls (null pointers). Use `nullFormat` to specify which strings represent null values. For example, if you configure `nullFormat="null"`, when the source data is a null pointer, Data Integration serializes it into the literal string 'null' (4 characters).	No	None
dateFormat	The format for serializing DATE-type data into a file. Example: "dateFormat":"yyyy-MM-dd".	No	None
fileFormat	The format of the destination files. Valid values: CSV and TEXT. CSV is a strict format. If data to be written contains a column delimiter, the data is escaped with double quotation marks (") based on the CSV escape rule. TEXT is a simple format that uses a column delimiter to separate data. Delimiters in the data are not escaped.	No	TEXT
header	The header to write to the text file. In script mode, you can configure the header information. For example, you can set this parameter to "header":["id","name","age"]. Then, `id`, `name`, and `age` are written as the header to the first line of the FTP file.	No	None
markDoneFileName	The mark-done file name. After the data synchronization task is complete, the system generates a mark-done file. You can check this file to determine whether the task is successful. You must specify an absolute path for this file. For periodic batch tasks, include a scheduling parameter in the file name. For example, you can set the file name to `/user/ftp/markDone_${bizdate}.txt`, where `${bizdate}` is a scheduling parameter.	No	None