FTP data source

更新时间:
复制 MD 格式

DataWorks supports FTP as a data source to read data from and write data to FTP servers. This topic describes the capabilities of DataWorks for FTP data synchronization.

Limitations

FTP Reader reads data from remote FTP files and converts it into the Data Integration protocol. Remote FTP files are stored as unstructured data. The following table describes the features that FTP Reader supports for data synchronization.

Supported

Unsupported

  • Reads data only from text files. The schema of the data in the text files must be a two-dimensional table.

  • Reads data from CSV-like files with custom delimiters.

  • Reads data of various data types as strings. Column pruning and constant columns are supported.

  • Supports recursive reads and file name filtering.

  • Supports file compression. The following compression formats are supported: gzip, bzip2, zip, lzo, and lzo_deflate.

  • Supports concurrent reads from multiple files.

  • Concurrent read of a single file by using parallel threads. This requires an internal file-splitting algorithm.

  • Concurrent read of a single compressed file by using parallel threads is not supported due to technical limitations.

FTP Writer converts data based on the Data Integration protocol and writes the data to files on an FTP server. Remote FTP files are stored as unstructured data. The following table describes the features that are supported by FTP Writer.

Supported

Unsupported

  • Writes data only to text files. BLOB data such as videos is not supported. The schema of the data in the text files must be a two-dimensional table.

  • Writes data to CSV-like and TEXT files with custom delimiters.

  • Writes data by using parallel threads. Each thread writes data to a different sub-file.

  • Concurrent writes to a single file.

  • Native data types. FTP Writer writes all data as the STRING data type.

  • File compression when data is written.

Supported data types

Remote FTP files have no native data types. Instead, the data types are defined by the FTP Reader in Data Integration.

Data Integration type

FTP type

LONG

LONG

DOUBLE

DOUBLE

STRING

STRING

BOOLEAN

BOOLEAN

DATE

DATE

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Single-table batch synchronization

Appendix: Script sample and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script sample

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        {
            "stepType":"ftp",// Plug-in name.
            "parameter":{
                "path":[],// File path.
                "nullFormat":"",// Null value.
                "compress":"",// Compression format.
                "datasource":"",// Data source.
                "column":[// Columns.
                    {
                        "index":0,// ID.
                        "type":""// Data type.
                    }
                ],
                "skipHeader":"",// Specifies whether to include a header.
                "fieldDelimiter":",",// Column delimiter.
                "encoding":"UTF-8",// Encoding format.
                "fileFormat":"csv"// File format.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Maximum number of allowed dirty data records.
        },
        "speed":{
        "throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Reader script parameters

Parameter

Description

Required

Default

datasource

The name of the data source, which must match the name of the data source you configured in DataWorks.

Yes

None

path

The path of the source file on the remote FTP file system. You must specify the full path of the source file, including the file name extension. You can specify multiple paths.

  • If you specify a single remote FTP file, FTP Reader can use only a single thread to extract data. Future versions of FTP Reader will support concurrent reads from a single uncompressed file by using parallel threads.

  • If you specify multiple remote FTP files, FTP Reader can use parallel threads to extract data. The number of concurrent threads is specified by the number of channels.

  • If you specify a wildcard, FTP Reader attempts to traverse and find multiple files. For example, if you specify /, FTP Reader reads all files in the / directory. If you specify /bazhen/, FTP Reader reads all files in the /bazhen/ directory. FTP Reader supports only the asterisk (*) as a file wildcard. You can also use scheduling parameters to flexibly configure file names and file paths.

Note
  • Avoid using the asterisk (*) wildcard. This wildcard may cause a Java Virtual Machine (JVM) out-of-memory (OOM) error.

  • Data Integration considers all text files that are synchronized in a job as a single data table. You must ensure that all files can be processed with the same schema.

  • You must ensure that the files are in a CSV-like format and that the Data Integration system has read permissions on them.

  • If no file that matches the specified path is found, the data synchronization task fails.

Yes

None

column

The list of columns to read. type specifies the data type of the source data. index specifies the column from which you want to read data. The value is 0-based. value specifies a constant column. In this case, data is not read from the source file. Instead, the system generates a column based on the value that you specify.

By default, you can read all data as the STRING type by using the "column":["*"] configuration. You can specify the column field as follows.

{
    "type": "long",
    "index": 0    // Reads data from the first column of the remote FTP text file as an INT field.
  },
  {
    "type": "string",
    "value": "alibaba"  // Generates a string field whose value is alibaba in FTP Reader.
  }

For the specified column, you must specify type and select either index or value.

Yes

None

fieldDelimiter

The delimiter that is used to separate columns in the source files.

Note

You must specify a delimiter for FTP Reader to read data. If you do not specify a delimiter, the default comma (,) is used. In the UI, a comma (,) is also specified by default.

Yes

,

skipHeader

A CSV-like file may have a header. You can skip the header. By default, the header is not skipped. This parameter is not supported for compressed files.

No

false

encoding

The encoding of the source files.

No

utf-8

nullFormat

Text files cannot use standard strings for nulls (null pointers). Use nullFormat to specify which strings represent null values. Examples:

  • If you set this parameter to nullFormat:"null" and the source data is the string "null", Data Integration processes the source data as a null value.

  • If you set this parameter to nullFormat:"\u0001" and the source data is the \u0001 string, Data Integration processes the source data as a null value.

  • If you do not configure the "nullFormat" parameter, no conversion is performed, and the source data is written to the destination as is.

No

None

markDoneFileName

The mark-done file name. Before a data synchronization task starts, the system checks whether the mark-done file exists. If the file does not exist, the system waits for a specific period of time and checks again. The task starts only after the file is detected.

No

None

maxRetryTime

The number of retries for the mark-done file check. The default value is 60. The interval between retries is 1 minute, which results in a total wait time of 60 minutes.

No

60

csvReaderConfig

The parameters that are used to read CSV files. The value must be of the Map type. CSV files are read by CsvReader. Multiple configurations are available. If you do not configure this parameter, default values are used.

No

None

fileFormat

The format of the source files. By default, files are read as CSV files and are processed as logical two-dimensional tables. If you set this parameter to binary, files are copied in binary format.

This parameter is typically used for mirroring directory structures between storage systems such as FTP and OSS. In most cases, you do not need to configure this parameter.

No

None

Writer script sample

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        { 
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ftp",// Plug-in name.
            "parameter":{
                "path":"",// File path.
                "fileName":"",// File name.
                "nullFormat":"null",// Null value.
                "dateFormat":"yyyy-MM-dd HH:mm:ss",// Date format.
                "datasource":"",// Data source.
                "writeMode":"",// Write mode.
                "fieldDelimiter":",",// Column delimiter.
                "encoding":"",// Encoding format.
                "fileFormat":""// File format.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Maximum number of allowed dirty data records.
        },
        "speed":{
            "throttle":true,// If false, the mbps parameter is ignored and no throttling is applied. If true, throttling is applied based on the mbps value.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script parameters

Parameter

Description

Required

Default

datasource

The name of the data source, which must match the name of the data source you configured in DataWorks.

Yes

None

timeout

The timeout period for connecting to the FTP server. Unit: milliseconds.

No

60,000 (1 minute)

path

The destination path on the FTP file system. FTP Writer writes multiple files to the directory that is specified by this parameter.

Yes

None

fileName

The name of the file to which FTP Writer writes data. A random suffix is appended to this file name to create the actual file name for each thread.

Yes

None

singleFileOutput

By default, FTP Writer appends a random suffix to the specified fileName to create a unique file name for each thread. If you do not want the random suffix to be added, you can set this parameter to true. Then, the output file name is exactly what you specify.

No

false

writeMode

The mode for clearing data before writing.

  • truncate: If singleFileOutput is true, files that have the same name in the destination directory are deleted before data is written. If singleFileOutput is false, all files that have the specified fileName prefix in the destination directory are deleted before data is written.

  • append: No data is cleared. FTP Writer writes files directly by using the specified fileName and ensures that file names do not conflict.

  • nonConflict: If a file that has the specified fileName prefix exists in the directory, the task fails and an error is reported.

Yes

None

fieldDelimiter

The delimiter that is used to separate columns in the destination files.

Yes (single character only)

None

skipHeader

CSV-likeCSVCSV files may have headers as titles,need to be skipped。By default, not skipped,Not supported in compressed file modeskipHeader

No

false

compress

Supportsgzipandbzip2two compression formats。

No

No compression

encoding

The encoding of the destination files.

No

utf-8

nullFormat

Text files cannot use standard strings for nulls (null pointers). Use nullFormat to specify which strings represent null values.

For example, if you configure nullFormat="null", when the source data is a null pointer, Data Integration serializes it into the literal string 'null' (4 characters).

No

None

dateFormat

The format for serializing DATE-type data into a file. Example: "dateFormat":"yyyy-MM-dd".

No

None

fileFormat

The format of the destination files. Valid values: CSV and TEXT. CSV is a strict format. If data to be written contains a column delimiter, the data is escaped with double quotation marks (") based on the CSV escape rule. TEXT is a simple format that uses a column delimiter to separate data. Delimiters in the data are not escaped.

No

TEXT

header

The header to write to the text file. In script mode, you can configure the header information. For example, you can set this parameter to "header":["id","name","age"]. Then, id, name, and age are written as the header to the first line of the FTP file.

No

None

markDoneFileName

  • The mark-done file name. After the data synchronization task is complete, the system generates a mark-done file. You can check this file to determine whether the task is successful. You must specify an absolute path for this file.

  • For periodic batch tasks, include a scheduling parameter in the file name. For example, you can set the file name to /user/ftp/markDone_${bizdate}.txt, where ${bizdate} is a scheduling parameter.

No

None