The Read CSV File component loads CSV files from Object Storage Service (OSS), HTTP servers, and Hadoop Distributed File System (HDFS) into a Machine Learning Platform for AI (PAI) pipeline.
Limitations
When configured in the PAI console, only MaxCompute, Realtime Compute for Apache Flink, and Deep Learning Containers (DLC) computing resources are supported.
When configured via the PyAlink Script component, call the component through code. For details, see PyAlink Script.
Prerequisites
Before you begin, ensure that you have:
(Optional) Authorized PAI to access OSS. Required only when you set fileSource to OSS. For details, see Grant the permissions that are required to use Machine Learning Designer.
Configured the Default Resource Preferred by Alink or FlinkML parameter on the Pipeline Attributes tab. The system uses this resource type automatically when running components.
Configure the Read CSV File component
Choose one of the following methods based on your workflow.
Method 1: PAI console
Configure the component on the Visualized Modeling (Designer) page.
Parameter setting tab
| Parameter | Description | Default |
|---|---|---|
| fileSource | Source of the CSV file. Valid values: OSS and OTHERS. Select OTHERS to use an HTTP server or HDFS path. | — |
| ossFilePath or filePath | Path of the CSV file. If fileSource is OSS, enter or select an OSS path. For files smaller than 1 GB, upload the CSV directly from the Select OSS directory or file page. If fileSource is OTHERS, enter an HTTP or HDFS path. | — |
| Schema | Data type for each column. Format: colname0 coltype0, colname1 coltype1, .... Example: f0 string,f1 bigint,f2 double. The data types must match the actual CSV column types. Do not use periods (.) in field names. | — |
| fieldDelimiter | Field delimiter. | , |
| quoteString | Quote character. | " |
| rowDelimiter | Row delimiter. | \n |
| ignoreFirstLine | Skip the first row. Turn on if the first row contains column headers. | Off |
| skipBlankLine | Skip blank rows. | — |
| handleInvalidMethod | Action taken when a tensor, vector, or MTable value fails to parse. These types are defined by the Alink algorithm framework and have a fixed parsing format. ERROR: stops reading. SKIP: skips the invalid value. | ERROR |
| lenient | Action taken when a row's data types or column count do not match the Schema. On: discard the row. Off: stop reading and display the error row. | Off |
Note: handleInvalidMethod and lenient control different failure modes. handleInvalidMethod applies to Alink-specific type parsing failures (tensor, vector, MTable). lenient applies to schema mismatches (wrong data type or column count). Configure both parameters to define a complete error-handling strategy.
Execution tuning tab
| Parameter | Description | Default |
|---|---|---|
| Number of Workers | Number of compute nodes. Positive integer. Range: 1–9999. Must be set together with Memory per worker. | — |
| Memory per worker | Memory per node, in MB. Positive integer. Range: 1024–65536. | — |
Method 2: PyAlink Script component
Use CsvSourceBatchOp to load a CSV file in a PyAlink Script component.
All examples use CsvSourceBatchOp to initialize the reader and chain .set*() methods to configure each parameter.
filePath = 'https://alink-test-data.oss-cn-hangzhou.aliyuncs.com/iris.csv'
schema = 'sepal_length double, sepal_width double, petal_length double, petal_width double, category string'
csvSource = CsvSourceBatchOp()\
.setFilePath(filePath)\
.setSchemaStr(schema)\
.setFieldDelimiter(",")
BatchOperator.collectToDataframe(csvSource)The following table lists all available parameters.
| Parameter | Required | Default | Description |
|---|---|---|---|
schemaStr | Yes | — | Data type for each column. Format: colname0 coltype0[, colname1 coltype1[, ...]]. Example: f0 string,f1 bigint,f2 double. |
filePath | No | — | Path of the CSV file. |
fieldDelimiter | No | , | Field delimiter. |
quoteString | No | " | Quote character. |
rowDelimiter | No | \n | Row delimiter. |
ignoreFirstLine | No | False | Set to True if the first row contains column headers. |
skipBlankLine | No | True | Skip blank rows. |
handleInvalidMethod | No | ERROR | Action for unparseable tensor, vector, or MTable values. ERROR: stop reading. SKIP: skip the value. |
lenient | No | False | True: discard rows that fail schema validation. False: return an error. |