如何在CollectionExpr针对二维数据集进行列操作,筛选,变换_云原生大数据计算服务 MaxCompute(MaxCompute)-阿里云帮助中心

DataFrame中所有二维数据集上的操作都属于CollectionExpr，可视为一张MaxCompute表或一张电子表单，DataFrame对象也是CollectionExpr的特例。CollectionExpr中包含针对二维数据集的列操作、筛选、变换等大量操作。

前提条件

您需要提前完成以下步骤，用于操作本文中的示例：

准备示例表pyodps_iris，详情请参见Dataframe数据处理。
创建DataFrame，详情请参见从MaxCompute表创建DataFrame。

获取类型

dtypes可以用来获取CollectionExpr中所有列的类型，dtypes 返回的是Schema类型，代码示例如下。

print(iris.dtypes)

返回结果：

odps.Schema {
  sepallength           float64
  sepalwidth            float64
  petallength           float64
  petalwidth            float64
  name                  string
}

列选择和增删

列选择

如果您需要从一个CollectionExpr中选取部分列，产生新的数据集，可以使用expr[columns]语法，代码示例如下。

print(iris['name', 'sepallength'].head(5))

返回结果：

          name  sepallength
0  Iris-setosa          4.9
1  Iris-setosa          4.7
2  Iris-setosa          4.6
3  Iris-setosa          5.0
4  Iris-setosa          5.4

说明

如果只需要选取一列，需要在Columns后加上逗号或者显示标记为列表，例如iris[iris.sepallength, ]或iris[[iris.sepallength]]，否则返回的将是一个Sequence对象，而不是Collection。

列删除

如果您需要在新的数据集中排除已有数据集的某些列，可使用exclude方法，代码示例如下。

print(iris.exclude('sepallength', 'petallength')[:5].head(5))

返回结果：

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

PyODPS 0.7.2版（及以上版本）支持另一种写法，即在数据集上直接排除相应的列。代码示例如下。

del iris['sepallength']
del iris['petallength']
print(iris[:5].head(5))

返回结果：

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

列增加

如果您需要在已有Collection中添加某一列变换的结果，可以使用expr[expr, new_sequence]语法，新增的列会作为新Collection的一部分。

示例：将iris中的sepalwidth列加一后重命名为sepalwidthplus1并追加到数据集末尾，形成新的数据集。代码如下：

print(iris[iris, (iris.sepalwidth + 1).rename('sepalwidthplus1')].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa   
1          4.7         3.2          1.3         0.2  Iris-setosa   
2          4.6         3.1          1.5         0.2  Iris-setosa   
3          5.0         3.6          1.4         0.2  Iris-setosa   
4          5.4         3.9          1.7         0.4  Iris-setosa   

   sepalwidthplus1  
0              4.0  
1              4.2  
2              4.1  
3              4.6  
4              4.9

使用expr[expr, new_sequence]语法需要注意，变换后的列名与原列名可能相同。如果需要与原Collection合并，请将该列重命名。PyODPS 0.7.2版（及以上版本）支持直接在当前数据集中追加，代码示例如下。

iris['sepalwidthplus1'] = iris.sepalwidth + 1
print(iris.head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa   
1          4.7         3.2          1.3         0.2  Iris-setosa   
2          4.6         3.1          1.5         0.2  Iris-setosa   
3          5.0         3.6          1.4         0.2  Iris-setosa   
4          5.4         3.9          1.7         0.4  Iris-setosa  

   sepalwidthplus1  
0              4.0  
1              4.2  
2              4.1  
3              4.6  
4              4.9

同时增删列

您可以先将原列通过exclude方法进行排除，再将变换后的新列并入，而不必担心重名。代码示例如下。

print(iris[iris.exclude('sepalwidth'), iris.sepalwidth * 2].head(5))

返回结果：

   sepallength  petallength  petalwidth         name  sepalwidth
0          4.9          1.4         0.2  Iris-setosa         6.0
1          4.7          1.3         0.2  Iris-setosa         6.4
2          4.6          1.5         0.2  Iris-setosa         6.2
3          5.0          1.4         0.2  Iris-setosa         7.2
4          5.4          1.7         0.4  Iris-setosa         7.8

PyODPS 0.7.2版（及以上版本）支持直接在当前数据集上覆盖的操作。代码示例如下。

iris['sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         6.0          1.4         0.2  Iris-setosa
1          4.7         6.4          1.3         0.2  Iris-setosa
2          4.6         6.2          1.5         0.2  Iris-setosa
3          5.0         7.2          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

以创建新Collection的方式实现增删列的另一种方法是调用select，将需要选择的列作为参数输入。如果您需要重命名，可以使用keyword参数输入，并将新的列名作为参数名。代码示例如下。

print(iris.select('name', sepalwidthminus1=iris.sepalwidth - 1).head(5))

返回结果：

          name  sepalwidthminus1
0  Iris-setosa               2.0
1  Iris-setosa               2.2
2  Iris-setosa               2.1
3  Iris-setosa               2.6
4  Iris-setosa               2.9

您也可以传入一个Lambda表达式，它接收的参数为上一步的运算结果。在执行时，PyODPS会检查这些Lambda表达式，传入上一步生成的Collection并将其替换为正确的列。代码示例如下。
```
print(iris['name', 'petallength'][[lambda x: x.name]].head(5))
```
返回结果：
```
          name
0  Iris-setosa
1  Iris-setosa
2  Iris-setosa
3  Iris-setosa
4  Iris-setosa
```

PyODPS 0.7.2版（及以上版本）支持对数据进行条件赋值。代码示例如下。

iris[iris.sepallength > 5.0, 'sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

引入常数和随机数

引入常数

DataFrame支持在Collection中追加一列常数。追加常数需要使用Scalar ，引入时需要手动指定列名。代码示例如下。

from odps.df import Scalar
print(iris[iris, Scalar(1).rename('id')][:5].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1

如果需要指定一个空值列，您可以使用NullScalar，但需要提供字段类型。代码示例如下。

from odps.df import NullScalar
print(iris[iris, NullScalar('float').rename('fid')][:5].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name   fid
0          4.9         3.0          1.4         0.2  Iris-setosa  None
1          4.7         3.2          1.3         0.2  Iris-setosa  None
2          4.6         3.1          1.5         0.2  Iris-setosa  None
3          5.0         3.6          1.4         0.2  Iris-setosa  None
4          5.4         3.9          1.7         0.4  Iris-setosa  None

在PyODPS 0.7.12（及以上版本）中，引入了简化写法。代码示例如下。

iris['id'] = 1
print(iris.head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1

需要注意的是，这种写法无法自动识别空值的类型，所以在增加空值列时，仍然要使用如下代码。

iris['null_col'] = NullScalar('float')
print(iris.head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name null_col
0          4.9         3.0          1.4         0.2  Iris-setosa     None
1          4.7         3.2          1.3         0.2  Iris-setosa     None
2          4.6         3.1          1.5         0.2  Iris-setosa     None
3          5.0         3.6          1.4         0.2  Iris-setosa     None
4          5.4         3.9          1.7         0.4  Iris-setosa     None

引入随机数

DataFrame也支持在Collection中增加一列随机数列，该列类型为FLOAT，范围为0~1，每行数值均不同。追加随机数列需要使用RandomScalar，参数为随机数种子，可省略。代码示例如下。

from odps.df import RandomScalar
iris[iris, RandomScalar().rename('rand_val')][:5]

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name  rand_val
0          4.9         3.0          1.4         0.2  Iris-setosa  0.000471
1          4.7         3.2          1.3         0.2  Iris-setosa  0.799520
2          4.6         3.1          1.5         0.2  Iris-setosa  0.834609
3          5.0         3.6          1.4         0.2  Iris-setosa  0.106921
4          5.4         3.9          1.7         0.4  Iris-setosa  0.763442

过滤数据

Collection提供了数据过滤的功能。支持使用与（&）、或（|）、非（~）、filter、Lambda表达式，及其他多种查询方式对数据进行过滤。

示例1：查询sepallength大于5的数据。

print(iris[iris.sepallength > 5].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.4         3.9          1.7         0.4  Iris-setosa
1          5.4         3.7          1.5         0.2  Iris-setosa
2          5.8         4.0          1.2         0.2  Iris-setosa
3          5.7         4.4          1.5         0.4  Iris-setosa
4          5.4         3.9          1.3         0.4  Iris-setosa

示例2：与（&）条件。

print(iris[(iris.sepallength < 5) & (iris['petallength'] > 1.5)].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

示例3：或（|）条件。

print(iris[(iris.sepalwidth < 2.5) | (iris.sepalwidth > 4)].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

说明

与和或条件必须使用&和|，不能使用and和or。

示例4：非（~）条件。

print(iris[~(iris.sepalwidth > 3)].head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.4         2.9          1.4         0.2  Iris-setosa
2          4.8         3.0          1.4         0.1  Iris-setosa
3          4.3         3.0          1.1         0.1  Iris-setosa
4          5.0         3.0          1.6         0.2  Iris-setosa

示例5：显式调用filter方法，提供多个与条件。

print(iris.filter(iris.sepalwidth > 3.5, iris.sepalwidth < 4).head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.0         3.6          1.4         0.2  Iris-setosa
1          5.4         3.9          1.7         0.4  Iris-setosa
2          5.4         3.7          1.5         0.2  Iris-setosa
3          5.4         3.9          1.3         0.4  Iris-setosa
4          5.7         3.8          1.7         0.3  Iris-setosa

示例6：对于连续的操作，使用Lambda表达式。

print(iris[iris.sepalwidth > 3.8]['name', lambda x: x.sepallength + 1].head(5))

返回结果：

          name  sepallength
0  Iris-setosa          6.4
1  Iris-setosa          6.8
2  Iris-setosa          6.7
3  Iris-setosa          6.4
4  Iris-setosa          6.2

示例7：对于Collection，如果它包含一个列是BOOLEAN类型，则可以直接使用该列作为过滤条件。

# 查询Schema
print(df.dtypes)
# 返回结果
odps.Schema {
  a boolean
  b int64
}

# a列为boolean类型，执行过滤操作
print(df[df.a])
# 返回结果
      a  b
0  True  1
1  True  3

因此，对Collection取单个Sequence的操作时，只有BOOLEAN列是合法的，即可以对Collection进行以下过滤操作。

df[df.a, ]       # 取列操作。
df[[df.a]]       # 取列操作。
df.select(df.a)  # 显式取列。
df[df.a]         # a列是boolean列，执行过滤操作。
df.a             # 取单列。
df['a']          # 取单列。

示例8：使用Pandas中的query方法，通过查询语句做数据的筛选，在表达式中直接使用列名（如sepallength）进行操作。在查询语句中，&和and都表示与操作，|和or都表示或操作。

print(iris.query("(sepallength < 5) and (petallength > 1.5)").head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

当表达式中需要使用到本地变量时，需要在该变量前加一个@前缀。

var = 4
print(iris.query("(sepalwidth < 2.5) | (sepalwidth > @var)").head(5))

返回结果：

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

目前query支持的语法，如下表所示。

语法	说明
name	没有`@`前缀的都当作列名处理，有前缀的会获取本地变量。
operator	支持部分运算符：`+`、`-`、``、`/`、`//`、`%`、`*`、`==`、`!=`、`<`、`<=`、`>`、`>=`、`in`、`not in`。
bool	与或非操作，其中`&`和`and`表示与，`\|`和`or`表示或。
attribute	取对象属性。
index, slice, subscript	切片操作。

并列多行输出

对于LIST及MAP类型的列，explode方法会将该列转换为多行输出。您也可以使用apply方法实现多行输出。为了进行聚合等操作，常常需要将这些输出和原表中的列合并。此时可以使用DataFrame提供的并列多行输出功能，写法为将多行输出函数生成的集合与原集合中的列名一起映射。并列多行输出的示例如下。
- 查询示例数据：
```
print(df)
```
  返回结果：
```
   id         a             b
0   1  [a1, b1]  [a2, b2, c2]
1   2      [c1]      [d2, e2]
```
- 示例1：
```
print(df[df.id, df.a.explode(), df.b])
```
  返回结果：
```
   id   a             b
0   1  a1  [a2, b2, c2]
1   1  b1  [a2, b2, c2]
2   2  c1      [d2, e2]
```
- 示例2：
```
print(df[df.id, df.a.explode(), df.b.explode()])
```
  返回结果：
```
   id   a   b
0   1  a1  a2
1   1  a1  b2
2   1  a1  c2
3   1  b1  a2
4   1  b1  b2
5   1  b1  c2
6   2  c1  d2
7   2  c1  e2
```
如果多行输出方法对某个输入不产生任何输出，默认输入行将不在最终结果中出现。如果需要在结果中出现该行，可以设置keep_nulls=True。此时，与该行并列的值将输出为空值。示例如下。
- 查询示例数据：
```
print(df)
```
  返回结果：
```
   id         a
0   1  [a1, b1]
1   2        []
```
- 示例1：
```
print(df[df.id, df.a.explode()])
```
  返回结果：
```
   id   a
0   1  a1
1   1  b1
```
- 示例2：
```
print(df[df.id, df.a.explode(keep_nulls=True)])
```
  返回结果：
```
   id     a
0   1    a1
1   1    b1
2   2  None
```
使用explode方法实现并列多行输出的示例，请参见集合类型相关操作。

限制条数

输出前三条数据。

print(iris[:3].execute())

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa

目前对于MaxCompute SQL，后端切片不支持start和step方法，但支持limit方法。

print(iris.limit(3).execute())

返回结果：

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa

说明

切片操作只能作用于Collection，不能作用于Sequence。