Read CSV file-Platform For AI(PAI)-阿里云帮助中心

The Read CSV File component loads CSV files from Object Storage Service (OSS), HTTP servers, and Hadoop Distributed File System (HDFS) into a Machine Learning Platform for AI (PAI) pipeline.

Limitations

When configured in the PAI console, only MaxCompute, Realtime Compute for Apache Flink, and Deep Learning Containers (DLC) computing resources are supported.
When configured via the PyAlink Script component, call the component through code. For details, see PyAlink Script.

Prerequisites

Before you begin, ensure that you have:

(Optional) Authorized PAI to access OSS. Required only when you set fileSource to OSS. For details, see Grant the permissions that are required to use Machine Learning Designer.
Configured the Default Resource Preferred by Alink or FlinkML parameter on the Pipeline Attributes tab. The system uses this resource type automatically when running components.

Configure the Read CSV File component

Choose one of the following methods based on your workflow.

Method 1: PAI console

Configure the component on the Visualized Modeling (Designer) page.

Parameter setting tab

Parameter	Description	Default
fileSource	Source of the CSV file. Valid values: OSS and OTHERS. Select OTHERS to use an HTTP server or HDFS path.	—
ossFilePath or filePath	Path of the CSV file. If fileSource is OSS, enter or select an OSS path. For files smaller than 1 GB, upload the CSV directly from the Select OSS directory or file page. If fileSource is OTHERS, enter an HTTP or HDFS path.	—
Schema	Data type for each column. Format: `colname0 coltype0, colname1 coltype1, ...`. Example: `f0 string,f1 bigint,f2 double`. The data types must match the actual CSV column types. Do not use periods (.) in field names.	—
fieldDelimiter	Field delimiter.	`,`
quoteString	Quote character.	`"`
rowDelimiter	Row delimiter.	`\n`
ignoreFirstLine	Skip the first row. Turn on if the first row contains column headers.	Off
skipBlankLine	Skip blank rows.	—
handleInvalidMethod	Action taken when a tensor, vector, or MTable value fails to parse. These types are defined by the Alink algorithm framework and have a fixed parsing format. ERROR: stops reading. SKIP: skips the invalid value.	ERROR
lenient	Action taken when a row's data types or column count do not match the Schema. On: discard the row. Off: stop reading and display the error row.	Off

Note: handleInvalidMethod and lenient control different failure modes. handleInvalidMethod applies to Alink-specific type parsing failures (tensor, vector, MTable). lenient applies to schema mismatches (wrong data type or column count). Configure both parameters to define a complete error-handling strategy.

Execution tuning tab

Parameter	Description	Default
Number of Workers	Number of compute nodes. Positive integer. Range: 1–9999. Must be set together with Memory per worker.	—
Memory per worker	Memory per node, in MB. Positive integer. Range: 1024–65536.	—

Method 2: PyAlink Script component

Use CsvSourceBatchOp to load a CSV file in a PyAlink Script component.

All examples use CsvSourceBatchOp to initialize the reader and chain .set*() methods to configure each parameter.

filePath = 'https://alink-test-data.oss-cn-hangzhou.aliyuncs.com/iris.csv'
schema = 'sepal_length double, sepal_width double, petal_length double, petal_width double, category string'
csvSource = CsvSourceBatchOp()\
    .setFilePath(filePath)\
    .setSchemaStr(schema)\
    .setFieldDelimiter(",")
BatchOperator.collectToDataframe(csvSource)

The following table lists all available parameters.

Parameter	Required	Default	Description
`schemaStr`	Yes	—	Data type for each column. Format: `colname0 coltype0[, colname1 coltype1[, ...]]`. Example: `f0 string,f1 bigint,f2 double`.
`filePath`	No	—	Path of the CSV file.
`fieldDelimiter`	No	`,`	Field delimiter.
`quoteString`	No	`"`	Quote character.
`rowDelimiter`	No	`\n`	Row delimiter.
`ignoreFirstLine`	No	`False`	Set to `True` if the first row contains column headers.
`skipBlankLine`	No	`True`	Skip blank rows.
`handleInvalidMethod`	No	`ERROR`	Action for unparseable tensor, vector, or MTable values. `ERROR`: stop reading. `SKIP`: skip the value.
`lenient`	No	`False`	`True`: discard rows that fail schema validation. `False`: return an error.