Amazon Redshift data source

更新时间:
复制 MD 格式

DataWorks provides Amazon Redshift Reader and Amazon Redshift Writer for you to read data from and write data to Amazon Redshift data sources. You can configure a synchronization task for an Amazon Redshift data source by using the codeless user interface (UI) or code editor. This topic describes the capabilities of synchronizing data from or to Amazon Redshift data sources.

Supported Amazon Redshift versions

The driver that is used by Amazon Redshift is redshift-jdbc4.2 Driver 2.1.0.1. For information about the capabilities of the driver, see Configure a JDBC driver version for Amazon Redshift.

Data type mappings

For information about the mappings between Amazon Redshift data types and SQL data types and the mappings between Amazon Redshift data types and Java data types, see the official Amazon Redshift documentation. The following table lists the data type mappings.

Amazon Redshift data type

SQL data type

Java data type

BIGINT

SQL_BIGINT

LONG

BOOLEAN

SQL_BIT

Boolean

CHAR

SQL_CHAR

STRING

DATE

SQL_TYPE_DATE

java.sql.Date

DECIMAL

SQL_NUMERIC

BigDecimal

DOUBLE PRECISION

SQL_DOUBLE

Double

GEOMETRY

SQL_ LONGVARBINARY

byte[]

INTEGER

SQL_INTEGER

INTEGER

OID

SQL_BIGINT

LONG

SUPER

SQL_LONGVARCHAR

STRING

REAL

SQL_REAL

Float

SMALLINT

SQL_SMALLINT

SHORT

TEXT

SQL_VARCHAR

STRING

TIME

SQL_TYPE_TIME

java.sql.Time

TIMETZ

SQL_TYPE_TIME

java.sql.Time

TIMESTAMP

SQL_TYPE_ TIMESTAMP

java.sql.Timestamp

TIMESTAMPTZ

SQL_TYPE_ TIMESTAMP

java.sql.Timestamp

VARCHAR

SQL_VARCHAR

STRING

Preparations for data synchronization

Before you synchronize data in DataWorks, you must establish network connections between your data sources and a serverless resource group or an exclusive resource group for Data Integration to enable the resource group to access the data sources over an internal network. We recommend that you use a serverless resource group for data synchronization. For information about how to establish network connections, see Network connectivity solutions.

Create a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

The following are the key parameters for an Amazon Redshift data source.

  • JDBC URL: The JDBC connection string, including the IP address, port number, database, and connection parameters. Both public and private IP addresses are supported. If you use a public IP address, ensure that the data integration resource group can access the host where your Amazon Redshift instance is located.

  • Username: The username for your Amazon Redshift database.

  • Password: The password for the specified username.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Script examples and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script example

{
  "stepType": "redshift"
  "parameter":
  {
    "datasource":"redshift_datasource",
    "table": "redshift_table_name",
    "where": "xxx=3",
    "splitPk": "id",
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Reader",
  "category": "reader"
}

Reader parameters

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

table

The name of the table from which you want to read data.

Yes

No default value

column

The list of columns to synchronize. Use commas to separate column names, for example,"column":["id","name","age"].

To synchronize all columns, use an asterisk (*), for example,"column":["*"].

Yes

No default value

where

The WHERE clause. The batch synchronization task concatenates the values of the column, table, and where parameters to form an SQL statement and executes the SQL statement to read data from the source table. When you perform a test, you can set the where parameter to limit 10.

To read the data that is generated on the current day, you can set the where parameter to gmt_create > $bizdate.

  • You can use the WHERE clause to read incremental data.

  • If the where parameter is not provided or is left empty, Amazon Redshift Reader reads all data from the source table.

No

No default value

splitPk

The column to use for data sharding. If you specify thesplitPk parameter, the system uses parallel threads to synchronize data, which improves efficiency.

No

No default value

Writer script example

{
  "stepType": "redshift",// The plugin name.
  "parameter":
  {
    "postSql":["delete from XXX;"],
    "preSql":["delete from XXX;"],
    "datasource":"redshift_datasource",// The data source name.
    "table": "redshift_table_name",// The table name.
    "writeMode": "insert",
    "batchSize": 2048,
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Writer",
  "category": "writer"
}

Writer parameters

Parameter

Description

Required

Default

datasource

The name of the data source, which must match the name you configured in DataWorks.

Yes

N/A

table

The name of the destination table.

Yes

N/A

column

The destination columns to write data to. Use commas to separate column names, for example,"column":["id","name","age"].

To write to all columns in order, use an asterisk (*), for example,"column":["*"].

Yes

N/A

preSql

An SQL statement to execute before the data synchronization task starts. For example, you can use this to clear old data from the target table. You can specify one statement in the codeless UI or multiple statements in the code editor.

No

N/A

postSql

An SQL statement to execute after the data synchronization task finishes. For example, you can use this to add a timestamp. You can specify one statement in the codeless UI or multiple statements in the code editor.

No

N/A

batchSize

The number of records to write per batch.

No

2048

writeMode

The write mode. Only insert is supported.

No

insert