Read CSV file

更新时间:
复制 MD 格式

The Read CSV File component loads CSV files from Object Storage Service (OSS), HTTP servers, and Hadoop Distributed File System (HDFS) into a Machine Learning Platform for AI (PAI) pipeline.

Limitations

  • When configured in the PAI console, only MaxCompute, Realtime Compute for Apache Flink, and Deep Learning Containers (DLC) computing resources are supported.

  • When configured via the PyAlink Script component, call the component through code. For details, see PyAlink Script.

Prerequisites

Before you begin, ensure that you have:

Configure the Read CSV File component

Choose one of the following methods based on your workflow.

Method 1: PAI console

Configure the component on the Visualized Modeling (Designer) page.

Parameter setting tab

ParameterDescriptionDefault
fileSourceSource of the CSV file. Valid values: OSS and OTHERS. Select OTHERS to use an HTTP server or HDFS path.
ossFilePath or filePathPath of the CSV file. If fileSource is OSS, enter or select an OSS path. For files smaller than 1 GB, upload the CSV directly from the Select OSS directory or file page. If fileSource is OTHERS, enter an HTTP or HDFS path.
SchemaData type for each column. Format: colname0 coltype0, colname1 coltype1, .... Example: f0 string,f1 bigint,f2 double. The data types must match the actual CSV column types. Do not use periods (.) in field names.
fieldDelimiterField delimiter.,
quoteStringQuote character."
rowDelimiterRow delimiter.\n
ignoreFirstLineSkip the first row. Turn on if the first row contains column headers.Off
skipBlankLineSkip blank rows.
handleInvalidMethodAction taken when a tensor, vector, or MTable value fails to parse. These types are defined by the Alink algorithm framework and have a fixed parsing format. ERROR: stops reading. SKIP: skips the invalid value.ERROR
lenientAction taken when a row's data types or column count do not match the Schema. On: discard the row. Off: stop reading and display the error row.Off
Note: handleInvalidMethod and lenient control different failure modes. handleInvalidMethod applies to Alink-specific type parsing failures (tensor, vector, MTable). lenient applies to schema mismatches (wrong data type or column count). Configure both parameters to define a complete error-handling strategy.

Execution tuning tab

ParameterDescriptionDefault
Number of WorkersNumber of compute nodes. Positive integer. Range: 1–9999. Must be set together with Memory per worker.
Memory per workerMemory per node, in MB. Positive integer. Range: 1024–65536.

Method 2: PyAlink Script component

Use CsvSourceBatchOp to load a CSV file in a PyAlink Script component.

All examples use CsvSourceBatchOp to initialize the reader and chain .set*() methods to configure each parameter.

filePath = 'https://alink-test-data.oss-cn-hangzhou.aliyuncs.com/iris.csv'
schema = 'sepal_length double, sepal_width double, petal_length double, petal_width double, category string'
csvSource = CsvSourceBatchOp()\
    .setFilePath(filePath)\
    .setSchemaStr(schema)\
    .setFieldDelimiter(",")
BatchOperator.collectToDataframe(csvSource)

The following table lists all available parameters.

ParameterRequiredDefaultDescription
schemaStrYesData type for each column. Format: colname0 coltype0[, colname1 coltype1[, ...]]. Example: f0 string,f1 bigint,f2 double.
filePathNoPath of the CSV file.
fieldDelimiterNo,Field delimiter.
quoteStringNo"Quote character.
rowDelimiterNo\nRow delimiter.
ignoreFirstLineNoFalseSet to True if the first row contains column headers.
skipBlankLineNoTrueSkip blank rows.
handleInvalidMethodNoERRORAction for unparseable tensor, vector, or MTable values. ERROR: stop reading. SKIP: skip the value.
lenientNoFalseTrue: discard rows that fail schema validation. False: return an error.