Read and write RDS for PostgreSQL data by using Spark SQL

更新时间:
复制 MD 格式

AnalyticDB for MySQL supports Spark SQL jobs that read from and write to ApsaraDB RDS for PostgreSQL databases over JDBC. This tutorial walks through setting up the JDBC driver, creating a temporary view, and running read and write queries against an RDS for PostgreSQL table.

Prerequisites

Before you begin, ensure that you have:

Step 1: Prepare data in RDS for PostgreSQL

In your RDS for PostgreSQL instance, create a schema named demo and a table named tbl, then insert sample rows:

CREATE SCHEMA demo;

CREATE TABLE demo.tbl (
  id INTEGER,
  first_name VARCHAR(32),
  last_name  VARCHAR(32),
  age        INTEGER
);

INSERT INTO demo.tbl VALUES(1, 'a', 'b', 5);
INSERT INTO demo.tbl VALUES(2, 'c', 'd', 6);
INSERT INTO demo.tbl VALUES(3, 'e', 'f', 7);

Step 2: Download and upload the PostgreSQL JDBC driver

Spark requires the PostgreSQL JDBC driver on the classpath to open a JDBC connection at runtime. Upload the driver to Object Storage Service (OSS) so that Spark can load it when the job starts.

  1. Download the driver that matches your RDS for PostgreSQL version from the official website. This tutorial uses postgresql-42.7.7.jar.

  2. Upload the JAR file to OSS.

Step 3: Submit the Spark SQL job

  1. Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters, then click the target cluster ID.

  2. In the left-side navigation pane, choose Job Development > SQL Development.

  3. In the SQLConsole window, select the Spark engine and the job-specific resource group.

  4. Enter the following statements. Replace the placeholders with your actual values.

    -- Load the PostgreSQL JDBC driver from OSS
    ADD jar oss://<OSS_BUCKET_NAME>/postgresql-42.7.7.jar;
    
    -- Enable elastic network interface (ENI) access so Spark can reach the RDS for PostgreSQL instance
    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=<VSWITCH_ID>;
    SET spark.adb.eni.securityGroupId=<SECURITY_GROUP_ID>;
    
    -- Create a temporary view mapped to the RDS for PostgreSQL table
    CREATE TEMPORARY VIEW table_tmp
    USING org.apache.spark.sql.jdbc
    OPTIONS (
      url      'jdbc:postgresql://<RDS_ENDPOINT>:5432/<DB_NAME>',
      driver   'org.postgresql.Driver',
      dbtable  'demo.tbl',
      user     '<DB_USER>',
      password '<DB_PASSWORD>'
    );
    
    -- Write: insert a new row into the RDS for PostgreSQL table through the view
    INSERT INTO table_tmp VALUES(4, 'e', 'f', 8);
    
    -- Read: query all rows from the RDS for PostgreSQL table through the view
    SELECT * FROM table_tmp;

    Replace each placeholder with the appropriate value:

    Placeholder Description Example
    <OSS_BUCKET_NAME> Name of the OSS bucket where you uploaded the driver my-bucket
    <VSWITCH_ID> ID of the vSwitch associated with the RDS for PostgreSQL instance. Find it on the instance's Database Connection page by hovering over the VPC. vsw-bp1sxxsodv28ey5dl****
    <SECURITY_GROUP_ID> ID of the security group associated with the RDS for PostgreSQL instance sg-bp19mr685pmg4ihc****
    <RDS_ENDPOINT> Connection endpoint of the RDS for PostgreSQL instance pgm-t4n37****.pgsql.singapore.rds.aliyuncs.com
    <DB_NAME> Name of the target database test
    <DB_USER> Database account of the RDS for PostgreSQL instance myuser
    <DB_PASSWORD> Password for the database account

    The following table describes the key parameters:

    Parameter Description Scope
    ADD jar OSS path of the PostgreSQL JDBC driver JAR Read/Write
    spark.adb.eni.enabled Enables ENI-based network access. Set to true. Read/Write
    spark.adb.eni.vswitchId vSwitch ID for the RDS for PostgreSQL instance Read/Write
    spark.adb.eni.securityGroupId Security group ID associated with the RDS for PostgreSQL instance Read/Write
    table_tmp The name of the temporary view. This tutorial uses table_tmp as an example. Read/Write
    USING org.apache.spark.sql.jdbc Fixed value. Specifies the JDBC data source. Read/Write
    url JDBC connection string. Format: jdbc:postgresql://<host>:5432/<db_name> Read/Write
    driver Fixed value: org.postgresql.Driver Read/Write
    dbtable Table to access. Format: schema_name.table_name Read/Write
    user Database account of the RDS for PostgreSQL instance Read/Write
    password Password for the database account Read/Write
  5. Click Execute.

  6. After the job completes, view the query results in Logs on the Applications tab of the Job Development > Spark JAR Development page. For details, see Spark Development Editor.