Read partitioned table data with PyODPS-MaxCompute(MaxCompute)-阿里云帮助中心

Read data from a partitioned table by using PyODPS in a DataWorks workspace.

Prerequisites

The following requirements must be met:

MaxCompute is activated.
DataWorks is activated.
You have a workflow in DataWorks. For more information, see Create a workflow.

Procedure

Note

This example uses a DataWorks workspace in standard mode. When you create the workspace, do not select Join Public Preview of DataStudio . Workspaces in public preview are not compatible with this example.

Prepare test data.
1. Create a table and upload data. For more information, see Create a table and upload data.
  The following are the table schemas and source data.
  - The table creation statement for the partitioned table user_detail is as follows.
```
CREATE TABLE IF NOT EXISTS user_detail
(
userid    BIGINT COMMENT 'User ID',
job       STRING COMMENT 'Job type',
education STRING COMMENT 'Education'
) COMMENT 'User information table'
PARTITIONED BY (dt STRING COMMENT 'Date',region STRING COMMENT 'Region');
```
  - The statement to create the source table user_detail_ods is as follows.
```
CREATE TABLE IF NOT EXISTS user_detail_ods
(
  userid    BIGINT COMMENT 'User ID',
  job       STRING COMMENT 'Job type',
  education STRING COMMENT 'Education',
  dt STRING COMMENT 'Date',
  region STRING COMMENT 'Region'
);
```
  - Save the test data to the user_detail.txt file. Upload this file to the user_detail_ods table.
```
0001,Internet,Bachelor,20190715,beijing
0002,Education,Associate Degree,20190716,beijing
0003,Finance,Master,20190715,shandong
0004,Internet,Master,20190715,beijing
```
2. Write data from the source data table user_detail_ods to the partitioned table user_detail.
  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspace.
  3. Find the target workspace and in the Actions column, click Shortcuts > Data Development.
  4. Right-click the workflow and choose Create Node > ODPS SQL.
  5. Enter a node name and click OK.
  6. In the ODPS SQL node, enter the following code.
```
INSERT OVERWRITE TABLE user_detail PARTITION (dt, region) 
SELECT userid, job, education, dt, region FROM user_detail_ods;
```
  7. Click Run to write the data.

Use PyODPS to read data from the partitioned table.

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
Find the target workspace and in the Actions column, click Shortcuts > Data Development.
On the Data Development page, right-click the workflow that you created and choose Create Node > PyODPS 2.
Enter a node name and click OK.

In the PyODPS 2 node, enter the following code.

import sys
from odps import ODPS
reload(sys)
print('dt=' + args['dt'])
# Set the default system encoding to UTF-8.
sys.setdefaultencoding('utf8')
# Get the table object.
t = o.get_table('user_detail')
# Check whether a specific partition exists.
print t.exist_partition('dt=20190715,region=beijing')
# List all partitions in the table.
for partition in t.partitions:
    print partition.name
# Query data by using one of the following three methods.
# Method 1: Use open_reader() as a context manager. 
# The reader is automatically closed when the 'with' block is exited, which ensures proper resource cleanup.
with t.open_reader(partition='dt=20190715,region=beijing') as reader1:
    count = reader1.count
print("Query data from the partitioned table by using Method 1:")
for record in reader1:
    print record[0],record[1],record[2]
# Method 2: Use open_reader() without a context manager.
# This method allows you to access records by column name.
print("Query data from the partitioned table by using Method 2:")
reader2 = t.open_reader(partition='dt=20190715,region=beijing')
for record in reader2:
    print record["userid"],record["job"],record["education"]
# Method 3: Use read_table() on the ODPS object.
# This is the most concise option for simple read operations.
print("Query data from the partitioned table by using Method 3:")
for record in o.read_table('user_detail', partition='dt=20190715,region=beijing'):
    print record["userid"],record["job"],record["education"]

Click Run with Parameters.
In the Parameters dialog box, configure the parameters and click Run.
Configure the following parameters:
- Resource Group Name: Select Shared Resource Group.
- dt: Set to dt=20190715.

View the run results on the Runtime Log tab.

Executing user script with PyODPS 0.8.0
dt=20190715
True
dt='20190715',region='beijing'
dt='20190715',region='shandong'
dt='20190716',region='beijing'
Query data from the partitioned table by using Method 1:
4 Internet master
1 Internet bachelor
Query data from the partitioned table by using Method 2:
4 Internet master
1 Internet bachelor
Query data from the partitioned table by using Method 3:
4 Internet master
1 Internet bachelor