Analyze data lakes with OSS foreign tables
Use OSS foreign tables to import and analyze data from Object Storage Service (OSS). Based on the OSS foreign data wrapper (FDW), these tables also support cross-account data import.
Limitations
The AnalyticDB for PostgreSQL instance and the OSS bucket must be in the same region.
Features
OSS FDW is based on the PostgreSQL Foreign Data Wrapper (FDW) framework. It lets you:
-
Import data from OSS into local row-oriented or column-oriented tables to accelerate analysis.
-
Directly query and analyze large volumes of data in OSS.
-
Join OSS foreign tables with local tables for analysis.
OSS FDW supports various data file formats, including:
-
Uncompressed text files in CSV, TEXT, JSON, and JSON Lines formats.
-
GZIP-compressed and Snappy-compressed text files in CSV and TEXT formats.
-
GZIP-compressed text files in JSON and JSON Lines formats.
-
ORC binary files. For data type mappings between ORC and AnalyticDB for PostgreSQL, see Data type mappings for ORC files.
-
Parquet binary files. For data type mappings between Parquet and AnalyticDB for PostgreSQL, see Data type mappings for Parquet files.
-
Avro binary files. For data type mappings between Avro and AnalyticDB for PostgreSQL, see Data type mappings for Avro files.
Before you begin
Prepare OSS data
Prepare the sample file example.csv.
Find OSS bucket information
This section describes how to find the Buckets, Bucket Name, Port, and Bucket Domain Name.
-
Log in to the OSS console.
-
In the left navigation pane, click Access from ECS over the VPC (internal network).
-
On the Endpoint page, click the target bucket.
On the Object Management page, you can find the bucket name.
-
On the Object Management page, find the object path.
-
In the left navigation pane, click Overview.
-
In the Port section on the Overview page, find the endpoint and bucket domain name.
We recommend using the endpoint for Access from ECS over the VPC (internal network).
Get an AccessKey ID and AccessKey Secret
To get an AccessKey ID and AccessKey Secret, see Create an AccessKey.
Create an OSS server
Use the CREATE SERVER statement to create an OSS server. This server defines the connection to the OSS service that you want to access. For more information about CREATE SERVER, see CREATE SERVER.
Syntax
CREATE SERVER server_name
FOREIGN DATA WRAPPER fdw_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
|
Parameter |
Type |
Required |
Description |
|
server_name |
STRING |
Yes |
The name of the OSS server. |
|
fdw_name |
STRING |
Yes |
The name of the foreign data wrapper that manages the server. This parameter must be set to oss_fdw. |
The following table describes the parameters for the OPTIONS clause.
|
Parameter |
Type |
Required |
Description |
|
endpoint |
STRING |
Yes |
The endpoint used to access OSS.AnalyticDB for PostgreSQL supports only internal endpoints. For more information, see the public cloud section in Regions and endpoints. |
|
bucket |
STRING |
No |
The name of the bucket where the data files are stored. For details on how to obtain the bucket name, see Preparations. Note
|
|
speed_limit |
NUMERIC |
No |
The minimum amount of data to transfer within the period specified by speed_time to prevent a timeout. Unit: bytes. The default value is 1024. Use this option with the speed_time option. Note
By default, a timeout is triggered if less than 1024 bytes of data is transferred in 90 consecutive seconds. For more information, see OSS SDK Error Handling. |
|
speed_time |
NUMERIC |
No |
The time period within which the minimum data transfer amount specified by speed_limit must be met. Unit: seconds. The default value is 90. Use this option with the speed_limit option. Note
By default, a timeout is triggered if less than 1024 bytes of data is transferred in 90 consecutive seconds. For more information, see OSS SDK Error Handling. |
|
connect_timeout |
NUMERIC |
No |
The connection timeout period. Unit: seconds. The default value is 10. |
|
dns_cache_timeout |
NUMERIC |
No |
The DNS cache timeout period. Unit: seconds. The default value is 60. |
Examples
CREATE SERVER oss_serv
FOREIGN DATA WRAPPER oss_fdw
OPTIONS (
endpoint 'oss-cn-********.aliyuncs.com',
bucket 'adb-pg'
);
You can also use the ALTER SERVER statement to modify the configuration of the OSS server. For more information, see ALTER SERVER.
Examples:
-
Modify an option for the OSS server.
ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com'); -
Add an option to the OSS server.
ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20'); -
Drop an option from the OSS server.
ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);
To delete the OSS server, use the DROP SERVER statement. For more information, see DROP SERVER.
Create an OSS user mapping
After creating an OSS server, you must also create a user mapping. The CREATE USER MAPPING statement defines a mapping between a user in your AnalyticDB for PostgreSQL database and the credentials to access the OSS server. For more information, see CREATE USER MAPPING.
Syntax
CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
SERVER <server_name>
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
|
Parameter |
Type |
Required |
Description |
|
username |
string |
Yes. You must specify one of the four options. |
The name of an existing user to map in the AnalyticDB for PostgreSQL instance. |
|
USER |
string |
Maps the current user of the AnalyticDB for PostgreSQL instance. |
|
|
CURRENT_USER |
string |
||
|
PUBLIC |
string |
Creates a public mapping that applies to all users of the AnalyticDB for PostgreSQL instance, including those created later. |
|
|
server_name |
string |
Yes |
The name of the OSS server. |
The following table describes the parameters for the OPTIONS clause.
|
Parameter |
Type |
Required |
Description |
|
id |
string |
Yes |
The AccessKey ID. To obtain one, see Create an AccessKey pair. |
|
key |
string |
Yes |
The AccessKey secret. To obtain one, see Create an AccessKey pair. |
When you import or export data across Alibaba Cloud accounts, you must configure the AccessKey ID and AccessKey Secret of the Alibaba Cloud account that owns the OSS bucket.
Example
CREATE USER MAPPING FOR PUBLIC
SERVER oss_serv
OPTIONS (
id 'LTAI****************',
key 'yourAccessKeySecret'
);
You can also use the DROP USER MAPPING statement to delete a user mapping. For more information, see DROP USER MAPPING.
Create an OSS foreign table
After you have an OSS server and a user that can access it, use the CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.
Syntax
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
[, ... ]
] )
SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
|
Parameter |
Type |
Required |
Description |
|
table_name |
string |
Yes |
The name of the OSS foreign table. |
|
column_name |
string |
Yes |
The column name. |
|
data_type |
string |
Yes |
The data type of the column. |
The following table lists the parameters for the OPTIONS clause.
|
Parameter |
Type |
Required |
Description |
|
filepath |
string |
Yes. One of these three parameters is required. |
The full path and name of a single object in OSS. If you use this parameter, only the specified object is selected. |
|
prefix |
string |
Specifies a prefix for object paths. Only objects whose paths start with this prefix are selected. Regular expressions are not supported. Examples:
|
|
|
dir |
string |
The directory path in OSS must end with /, for example, test/mydir/. Selects all objects in the specified directory, excluding objects in its subdirectories. |
|
|
bucket |
string |
No |
The name of the bucket that contains the data objects. See Preparations to learn how to obtain the bucket name. Note
|
|
format |
string |
Yes |
The object format. Valid values:
|
|
filetype |
string |
No |
The compression type of the object. Valid values:
Note
|
|
log_errors |
boolean |
No |
Specifies whether to record errors in a log file. The default value is Note
This parameter applies only to objects in csv and text formats. |
|
segment_reject_limit |
numeric |
No |
The error threshold that causes the data loading task to abort. If the value includes a percent sign (
Note
This parameter applies only to objects in csv and text formats. |
|
header |
boolean |
No |
Specifies whether the source object contains a header row. Valid values:
Note
This parameter applies only to objects in csv format. |
|
delimiter |
string |
No |
The delimiter that separates fields. It must be a single-byte character.
Note
This parameter applies only to objects in csv and text formats. |
|
quote |
string |
No |
The quotation character for fields. It must be a single-byte character. The default is a double quotation mark ( Note
This parameter applies only to objects in csv format. |
|
escape |
string |
No |
The escape character that precedes a character that has a special meaning. It must be a single-byte character. The default is a double quotation mark ( Note
This parameter applies only to objects in csv format. |
|
null |
string |
No |
Specifies the string that represents a NULL value in the data object.
Note
This parameter applies only to objects in csv and text formats. |
|
encoding |
string |
No |
Specifies the encoding of the data objects. Defaults to the client encoding. Note
This parameter applies only to objects in csv and text formats. |
|
force_not_null |
boolean |
No |
Specifies whether field values can be empty strings. Valid values:
Note
This parameter applies only to objects in csv and text formats. |
|
force_null |
boolean |
No |
Specifies how to handle empty strings. Valid values:
Note
This parameter applies only to objects in csv and text formats. |
Example
CREATE FOREIGN TABLE ossexample (
date text,
time text,
open float,
high float,
low float,
volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
After you create an OSS foreign table, you can use one of the following methods to check which OSS objects the table maps to:
-
Method 1:
EXPLAIN VERBOSE SELECT * FROM <OSS foreign table name>; -
Method 2:
SELECT * FROM get_oss_table_meta('<OSS foreign table name>');
You can also use the DROP FOREIGN TABLE statement to drop the OSS foreign table. For more information, see DROP FOREIGN TABLE.
Query and analyze OSS data
You can query an OSS foreign table the same way you query a local table. The following examples show common queries.
-
Query data using a key-value filter.
SELECT * FROM ossexample WHERE volume = 5; -
Query data using an aggregate function.
SELECT count(*) FROM ossexample WHERE volume = 5; -
Query data using the GROUP BY and LIMIT clauses.
SELECT low, sum(volume) FROM ossexample GROUP BY low ORDER BY low limit 5;
Join analysis with OSS foreign tables
-
Create a local table named example and insert test data.
CREATE TABLE example (id int, volume int); INSERT INTO example VALUES(1,1), (2,3), (4,5); -
Run a join query on the example and ossexample tables.
SELECT example.volume, min(high), max(low) FROM ossexample, example WHERE ossexample.volume = example.volume GROUP BY(example.volume) ORDER BY example.volume;
Fault tolerance
OSS FDW provides fault tolerance using the log_errors and segment_reject_limit parameters. This prevents errors in the raw data from interrupting scans of an OSS foreign table.
For more information about the log_errors and segment_reject_limit parameters, see Create an OSS foreign table.
-
Create a fault-tolerant OSS foreign table.
CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text) SERVER oss_serv OPTIONS (log_errors 'true', -- Enable error logging. segment_reject_limit '10', -- Set the reject limit to 10 rows. The scan is aborted if this limit is exceeded. dir 'error_sales/', -- Specify the OSS object directory for the foreign table. format 'csv', -- Specify the object format as CSV. encoding 'utf8'); -- Specify the encoding. -
Query the error log.
SELECT * FROM gp_read_error_log('oss_error_sales'); -
Clear the error log.
SELECT gp_truncate_error_log('oss_error_sales');
FAQ
Q: Does deleting data from an OSS foreign table also delete the underlying data in OSS?
A: No. Deleting data from an OSS foreign table does not delete the underlying data in OSS.