Analyze data lakes with OSS foreign tables

更新时间:
复制 MD 格式

Use OSS foreign tables to import and analyze data from Object Storage Service (OSS). Based on the OSS foreign data wrapper (FDW), these tables also support cross-account data import.

Limitations

The AnalyticDB for PostgreSQL instance and the OSS bucket must be in the same region.

Features

OSS FDW is based on the PostgreSQL Foreign Data Wrapper (FDW) framework. It lets you:

  • Import data from OSS into local row-oriented or column-oriented tables to accelerate analysis.

  • Directly query and analyze large volumes of data in OSS.

  • Join OSS foreign tables with local tables for analysis.

OSS FDW supports various data file formats, including:

  • Uncompressed text files in CSV, TEXT, JSON, and JSON Lines formats.

  • GZIP-compressed and Snappy-compressed text files in CSV and TEXT formats.

  • GZIP-compressed text files in JSON and JSON Lines formats.

  • ORC binary files. For data type mappings between ORC and AnalyticDB for PostgreSQL, see Data type mappings for ORC files.

  • Parquet binary files. For data type mappings between Parquet and AnalyticDB for PostgreSQL, see Data type mappings for Parquet files.

  • Avro binary files. For data type mappings between Avro and AnalyticDB for PostgreSQL, see Data type mappings for Avro files.

Before you begin

Prepare OSS data

Prepare the sample file example.csv.

Find OSS bucket information

This section describes how to find the Buckets, Bucket Name, Port, and Bucket Domain Name.

  1. Log in to the OSS console.

  2. In the left navigation pane, click Access from ECS over the VPC (internal network).

  3. On the Endpoint page, click the target bucket.

    On the Object Management page, you can find the bucket name.

  4. On the Object Management page, find the object path.

  5. In the left navigation pane, click Overview.

  6. In the Port section on the Overview page, find the endpoint and bucket domain name.

    We recommend using the endpoint for Access from ECS over the VPC (internal network).

Get an AccessKey ID and AccessKey Secret

To get an AccessKey ID and AccessKey Secret, see Create an AccessKey.

Create an OSS server

Use the CREATE SERVER statement to create an OSS server. This server defines the connection to the OSS service that you want to access. For more information about CREATE SERVER, see CREATE SERVER.

Syntax

CREATE SERVER server_name
    FOREIGN DATA WRAPPER fdw_name
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Type

Required

Description

server_name

STRING

Yes

The name of the OSS server.

fdw_name

STRING

Yes

The name of the foreign data wrapper that manages the server. This parameter must be set to oss_fdw.

The following table describes the parameters for the OPTIONS clause.

Parameter

Type

Required

Description

endpoint

STRING

Yes

The endpoint used to access OSS.AnalyticDB for PostgreSQL supports only internal endpoints. For more information, see the public cloud section in Regions and endpoints.

bucket

STRING

No

The name of the bucket where the data files are stored. For details on how to obtain the bucket name, see Preparations.

Note
  • You must specify the bucket option for either the OSS server or the OSS FDW. For more information about this option when used with an OSS FDW, see Create an OSS FDW.

  • If the bucket option is specified for both the OSS server and the OSS FDW, the option specified for the OSS FDW takes precedence.

speed_limit

NUMERIC

No

The minimum amount of data to transfer within the period specified by speed_time to prevent a timeout. Unit: bytes. The default value is 1024.

Use this option with the speed_time option.

Note

By default, a timeout is triggered if less than 1024 bytes of data is transferred in 90 consecutive seconds. For more information, see OSS SDK Error Handling.

speed_time

NUMERIC

No

The time period within which the minimum data transfer amount specified by speed_limit must be met. Unit: seconds. The default value is 90.

Use this option with the speed_limit option.

Note

By default, a timeout is triggered if less than 1024 bytes of data is transferred in 90 consecutive seconds. For more information, see OSS SDK Error Handling.

connect_timeout

NUMERIC

No

The connection timeout period. Unit: seconds. The default value is 10.

dns_cache_timeout

NUMERIC

No

The DNS cache timeout period. Unit: seconds. The default value is 60.

Examples

CREATE SERVER oss_serv
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint 'oss-cn-********.aliyuncs.com',
        bucket 'adb-pg'
  );

You can also use the ALTER SERVER statement to modify the configuration of the OSS server. For more information, see ALTER SERVER.

Examples:

  • Modify an option for the OSS server.

    ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');
  • Add an option to the OSS server.

    ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');
  • Drop an option from the OSS server.

    ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);

To delete the OSS server, use the DROP SERVER statement. For more information, see DROP SERVER.

Create an OSS user mapping

After creating an OSS server, you must also create a user mapping. The CREATE USER MAPPING statement defines a mapping between a user in your AnalyticDB for PostgreSQL database and the credentials to access the OSS server. For more information, see CREATE USER MAPPING.

Syntax

CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
    SERVER <server_name>
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Type

Required

Description

username

string

Yes. You must specify one of the four options.

The name of an existing user to map in the AnalyticDB for PostgreSQL instance.

USER

string

Maps the current user of the AnalyticDB for PostgreSQL instance.

CURRENT_USER

string

PUBLIC

string

Creates a public mapping that applies to all users of the AnalyticDB for PostgreSQL instance, including those created later.

server_name

string

Yes

The name of the OSS server.

The following table describes the parameters for the OPTIONS clause.

Parameter

Type

Required

Description

id

string

Yes

The AccessKey ID. To obtain one, see Create an AccessKey pair.

key

string

Yes

The AccessKey secret. To obtain one, see Create an AccessKey pair.

Note

When you import or export data across Alibaba Cloud accounts, you must configure the AccessKey ID and AccessKey Secret of the Alibaba Cloud account that owns the OSS bucket.

Example

CREATE USER MAPPING FOR PUBLIC
    SERVER oss_serv
    OPTIONS (
        id 'LTAI****************',
        key 'yourAccessKeySecret'
    );

You can also use the DROP USER MAPPING statement to delete a user mapping. For more information, see DROP USER MAPPING.

Create an OSS foreign table

After you have an OSS server and a user that can access it, use the CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.

Syntax

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
    column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
      [, ... ]
] )
    SERVER server_name
  [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter

Type

Required

Description

table_name

string

Yes

The name of the OSS foreign table.

column_name

string

Yes

The column name.

data_type

string

Yes

The data type of the column.

The following table lists the parameters for the OPTIONS clause.

Parameter

Type

Required

Description

filepath

string

Yes. One of these three parameters is required.

The full path and name of a single object in OSS.

If you use this parameter, only the specified object is selected.

prefix

string

Specifies a prefix for object paths. Only objects whose paths start with this prefix are selected. Regular expressions are not supported.

Examples:

  • If you set prefix to test/filename, the following objects are imported:

    • test/filename

    • test/filenamexxx

    • test/filename/aa

    • test/filenameyyy/aa

    • test/filenameyyy/bb/aa

  • If you set prefix to test/filename/, only the following object from the preceding list is imported:

    • test/filename/aa

dir

string

The directory path in OSS must end with /, for example, test/mydir/.

Selects all objects in the specified directory, excluding objects in its subdirectories.

bucket

string

No

The name of the bucket that contains the data objects. See Preparations to learn how to obtain the bucket name.

Note
  • You must specify the bucket parameter for either the OSS server or the OSS foreign table.

  • If the bucket parameter is specified for both the OSS server and the OSS foreign table, the value for the OSS foreign table takes precedence.

format

string

Yes

The object format. Valid values:

  • csv

  • text

  • orc

  • avro

  • parquet

  • json

    For more information about the JSON specification, see JSON specification.

  • jsonline

    Represents JSON data where each line is a valid JSON object. All data readable with jsonline is also readable with json, but not vice versa. We recommend that you use jsonline when possible. For more information about the JSON Lines specification, see JSONLINE specification.

filetype

string

No

The compression type of the object. Valid values:

  • plain (default): Reads raw binary data without extra processing.

  • gzip: Reads raw binary data and decompresses it by using GZIP.

  • snappy: Reads raw binary data and decompresses it by using Snappy.

    Only standard Snappy compression is supported. Hadoop-Snappy compressed objects are not supported.

Note
  • This parameter applies only to objects in csv, text, json, and jsonline formats.

  • The snappy option does not support objects in json and jsonline formats.

log_errors

boolean

No

Specifies whether to record errors in a log file. The default value is false. For more information, see Fault tolerance.

Note

This parameter applies only to objects in csv and text formats.

segment_reject_limit

numeric

No

The error threshold that causes the data loading task to abort.

If the value includes a percent sign (%), it represents a percentage of rows. Otherwise, it represents an absolute number of rows. Examples:

  • segment_reject_limit = '10': The task stops and reports an error if the number of error rows exceeds 10.

  • segment_reject_limit = '10%': The task stops and reports an error if the number of error rows exceeds 10% of the total processed rows.

Note

This parameter applies only to objects in csv and text formats.

header

boolean

No

Specifies whether the source object contains a header row. Valid values:

  • true: The object contains a header row.

  • false (default): The object does not contain a header row.

Note

This parameter applies only to objects in csv format.

delimiter

string

No

The delimiter that separates fields. It must be a single-byte character.

  • For csv objects, the default is a comma (,).

  • For text objects, the default is a tab character.

Note

This parameter applies only to objects in csv and text formats.

quote

string

No

The quotation character for fields. It must be a single-byte character. The default is a double quotation mark (").

Note

This parameter applies only to objects in csv format.

escape

string

No

The escape character that precedes a character that has a special meaning. It must be a single-byte character. The default is a double quotation mark (").

Note

This parameter applies only to objects in csv format.

null

string

No

Specifies the string that represents a NULL value in the data object.

  • For csv format, the default is \N.

  • For text format, the default is an unquoted empty string.

Note

This parameter applies only to objects in csv and text formats.

encoding

string

No

Specifies the encoding of the data objects. Defaults to the client encoding.

Note

This parameter applies only to objects in csv and text formats.

force_not_null

boolean

No

Specifies whether field values can be empty strings. Valid values:

  • true: Field values cannot be empty strings.

  • false (default): Field values can be empty strings.

Note

This parameter applies only to objects in csv and text formats.

force_null

boolean

No

Specifies how to handle empty strings. Valid values:

  • true: Treats all empty strings as NULL, regardless of whether they are quoted.

  • false (default): Treats only unquoted empty strings as NULL.

Note

This parameter applies only to objects in csv and text formats.

Example

CREATE FOREIGN TABLE ossexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
Note

After you create an OSS foreign table, you can use one of the following methods to check which OSS objects the table maps to:

  • Method 1:

    EXPLAIN VERBOSE SELECT * FROM <OSS foreign table name>;
  • Method 2:

    SELECT * FROM get_oss_table_meta('<OSS foreign table name>');

You can also use the DROP FOREIGN TABLE statement to drop the OSS foreign table. For more information, see DROP FOREIGN TABLE.

Query and analyze OSS data

You can query an OSS foreign table the same way you query a local table. The following examples show common queries.

  • Query data using a key-value filter.

    SELECT * FROM ossexample WHERE volume = 5;
  • Query data using an aggregate function.

    SELECT count(*) FROM ossexample WHERE volume = 5;
  • Query data using the GROUP BY and LIMIT clauses.

    SELECT low, sum(volume)
      FROM ossexample
     GROUP BY low
     ORDER BY low
     limit 5;

Join analysis with OSS foreign tables

  1. Create a local table named example and insert test data.

    CREATE TABLE example (id int, volume int);
    INSERT INTO example VALUES(1,1), (2,3), (4,5);
  2. Run a join query on the example and ossexample tables.

    SELECT example.volume, min(high), max(low)
    FROM
    ossexample,
    example
    WHERE ossexample.volume = example.volume
    GROUP BY(example.volume)
    ORDER BY example.volume;

Fault tolerance

OSS FDW provides fault tolerance using the log_errors and segment_reject_limit parameters. This prevents errors in the raw data from interrupting scans of an OSS foreign table.

For more information about the log_errors and segment_reject_limit parameters, see Create an OSS foreign table.

  • Create a fault-tolerant OSS foreign table.

    CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text)
        SERVER oss_serv
        OPTIONS (log_errors 'true', -- Enable error logging.
                 segment_reject_limit '10', -- Set the reject limit to 10 rows. The scan is aborted if this limit is exceeded.
                 dir 'error_sales/', -- Specify the OSS object directory for the foreign table.
                 format 'csv', -- Specify the object format as CSV.
                 encoding 'utf8'); -- Specify the encoding. 
  • Query the error log.

    SELECT * FROM gp_read_error_log('oss_error_sales');
  • Clear the error log.

    SELECT gp_truncate_error_log('oss_error_sales');

FAQ

Q: Does deleting data from an OSS foreign table also delete the underlying data in OSS?

A: No. Deleting data from an OSS foreign table does not delete the underlying data in OSS.

Related documentation