Read partitioned table data with PyODPS

更新时间:
复制 MD 格式

This topic shows you how to read data from a partitioned table with PyODPS.

Prerequisites

Ensure the following requirements are met:

Procedure

Note

This example uses a DataWorks workspace in standard mode. When you create the workspace, do not select Join Public Preview of DataStudio . Workspaces in public preview are not compatible with this example.

  1. Prepare the test data.

    1. Create tables and upload data. For instructions, see Create tables and upload data.

      The table schema and source data are as follows.

      • The following DDL statement creates the partitioned table user_detail.

        CREATE TABLE IF NOT EXISTS user_detail
        (
        userid    BIGINT COMMENT 'user ID',
        job       STRING COMMENT 'job type',
        education STRING COMMENT 'education level'
        ) COMMENT 'User information table'
        PARTITIONED BY (dt STRING COMMENT 'date',region STRING COMMENT 'region');
      • The following DDL statement creates the source data table user_detail_ods.

        CREATE TABLE IF NOT EXISTS user_detail_ods
        (
          userid    BIGINT COMMENT 'user ID',
          job       STRING COMMENT 'job type',
          education STRING COMMENT 'education level',
          dt STRING COMMENT 'date',
          region STRING COMMENT 'region'
        );
      • Save the following test data as a user_detail.txt file and upload the file to the user_detail_ods table.

        0001,Internet,Bachelor,20190715,beijing
        0002,Education,junior college,20190716,beijing
        0003,Finance,Master,20190715,shandong
        0004,Internet,Master,20190715,beijing
    2. Write the data from the source data table user_detail_ods to the partitioned table user_detail.

      1. Log on to the DataWorks console.

      2. In the left-side navigation pane, click Workspaces.

      3. Find the target workspace. In the Actions column, go to Shortcuts > Data Development.

      4. Right-click the workflow and choose Create Node > ODPS SQL.

      5. Enter a name for the node and click Confirm.

      6. In the editor for the ODPS SQL node, enter the following code.

        INSERT OVERWRITE TABLE user_detail PARTITION (dt, region) 
        SELECT userid, job, education, dt, region FROM user_detail_ods;
      7. Click Run.

  2. Use PyODPS to read data from the partitioned table.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspaces.

    3. Find the target workspace. In the Actions column, go to Shortcuts > Data Development.

    4. On the Data Development page, right-click the created workflow and choose Create Node > PyODPS 2.

    5. Enter a name for the node and click Confirm.

    6. In the PyODPS 2 node, enter the following code.

      import sys
      from odps import ODPS
      reload(sys)
      print('dt=' + args['dt'])
      # Set the default system encoding to UTF-8.
      sys.setdefaultencoding('utf8')
      # Get the table object.
      t = o.get_table('user_detail')
      # Check whether a specific partition exists.
      print t.exist_partition('dt=20190715,region=beijing')
      # List all partitions in the table.
      for partition in t.partitions:
          print partition.name
      # Query data by using one of the following three methods.
      # Method 1: Use open_reader() as a context manager. 
      # The reader is automatically closed when the 'with' block is exited, which ensures proper resource cleanup.
      with t.open_reader(partition='dt=20190715,region=beijing') as reader1:
          count = reader1.count
      print("Query data from the partitioned table by using Method 1:")
      for record in reader1:
          print record[0],record[1],record[2]
      # Method 2: Use open_reader() without a context manager.
      # This method allows you to access records by column name.
      print("Query data from the partitioned table by using Method 2:")
      reader2 = t.open_reader(partition='dt=20190715,region=beijing')
      for record in reader2:
          print record["userid"],record["job"],record["education"]
      # Method 3: Use read_table() on the ODPS object.
      # This is the most concise option for simple read operations.
      print("Query data from the partitioned table by using Method 3:")
      for record in o.read_table('user_detail', partition='dt=20190715,region=beijing'):
          print record["userid"],record["job"],record["education"]
    7. Click Run with Parameters.

    8. In the Parameters dialog box, set the parameters and click Run.

      The parameters are as follows:

      • Resource Group Name: Select Default Resource Group.

      • dt: Enter 20190715.

      image

    9. View the run results on the Runtime Log tab.

      Executing user script with PyODPS 0.8.0
      dt=20190715
      True
      dt='20190715',region='beijing'
      dt='20190715',region='shandong'
      dt='20190716',region='beijing'
      Query data from the partitioned table by using Method 1:
      4 Internet master
      1 Internet bachelor
      Query data from the partitioned table by using Method 2:
      4 Internet master
      1 Internet bachelor
      Query data from the partitioned table by using Method 3:
      4 Internet master
      1 Internet bachelor