DataWorks provides Amazon Redshift Reader and Amazon Redshift Writer for you to read data from and write data to Amazon Redshift data sources. You can configure a synchronization task for an Amazon Redshift data source by using the codeless user interface (UI) or code editor. This topic describes the capabilities of synchronizing data from or to Amazon Redshift data sources.
Supported Amazon Redshift versions
The driver that is used by Amazon Redshift is redshift-jdbc4.2 Driver 2.1.0.1. For information about the capabilities of the driver, see Configure a JDBC driver version for Amazon Redshift.
Data type mappings
For information about the mappings between Amazon Redshift data types and SQL data types and the mappings between Amazon Redshift data types and Java data types, see the official Amazon Redshift documentation. The following table lists the data type mappings.
Amazon Redshift data type |
SQL data type |
Java data type |
BIGINT |
SQL_BIGINT |
LONG |
BOOLEAN |
SQL_BIT |
Boolean |
CHAR |
SQL_CHAR |
STRING |
DATE |
SQL_TYPE_DATE |
java.sql.Date |
DECIMAL |
SQL_NUMERIC |
BigDecimal |
DOUBLE PRECISION |
SQL_DOUBLE |
Double |
GEOMETRY |
SQL_ LONGVARBINARY |
byte[] |
INTEGER |
SQL_INTEGER |
INTEGER |
OID |
SQL_BIGINT |
LONG |
SUPER |
SQL_LONGVARCHAR |
STRING |
REAL |
SQL_REAL |
Float |
SMALLINT |
SQL_SMALLINT |
SHORT |
TEXT |
SQL_VARCHAR |
STRING |
TIME |
SQL_TYPE_TIME |
java.sql.Time |
TIMETZ |
SQL_TYPE_TIME |
java.sql.Time |
TIMESTAMP |
SQL_TYPE_ TIMESTAMP |
java.sql.Timestamp |
TIMESTAMPTZ |
SQL_TYPE_ TIMESTAMP |
java.sql.Timestamp |
VARCHAR |
SQL_VARCHAR |
STRING |
Preparations for data synchronization
Before you synchronize data in DataWorks, you must establish network connections between your data sources and a serverless resource group or an exclusive resource group for Data Integration to enable the resource group to access the data sources over an internal network. We recommend that you use a serverless resource group for data synchronization. For information about how to establish network connections, see Network connectivity solutions.
Create a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
The following are the key parameters for an Amazon Redshift data source.
-
JDBC URL: The JDBC connection string, including the IP address, port number, database, and connection parameters. Both public and private IP addresses are supported. If you use a public IP address, ensure that the data integration resource group can access the host where your Amazon Redshift instance is located.
-
Username: The username for your Amazon Redshift database.
-
Password: The password for the specified username.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
-
For a complete list of parameters and a script example for the code editor, see Appendix: Script examples and parameter descriptions.
Appendix: Script examples and parameter descriptions
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script example
{
"stepType": "redshift"
"parameter":
{
"datasource":"redshift_datasource",
"table": "redshift_table_name",
"where": "xxx=3",
"splitPk": "id",
"column":
[
"id",
"table_id",
"table_no",
"table_name",
"table_status"
]
},
"name": "Reader",
"category": "reader"
}
Reader parameters
Parameter |
Description |
Required |
Default value |
datasource |
The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. |
Yes |
No default value |
table |
The name of the table from which you want to read data. |
Yes |
No default value |
column |
The list of columns to synchronize. Use commas to separate column names, for example, To synchronize all columns, use an asterisk (*), for example, |
Yes |
No default value |
where |
The WHERE clause. The batch synchronization task concatenates the values of the column, table, and where parameters to form an SQL statement and executes the SQL statement to read data from the source table. When you perform a test, you can set the where parameter to limit 10. To read the data that is generated on the current day, you can set the where parameter to gmt_create > $bizdate.
|
No |
No default value |
splitPk |
The column to use for data sharding. If you specify the |
No |
No default value |
Writer script example
{
"stepType": "redshift",// The plugin name.
"parameter":
{
"postSql":["delete from XXX;"],
"preSql":["delete from XXX;"],
"datasource":"redshift_datasource",// The data source name.
"table": "redshift_table_name",// The table name.
"writeMode": "insert",
"batchSize": 2048,
"column":
[
"id",
"table_id",
"table_no",
"table_name",
"table_status"
]
},
"name": "Writer",
"category": "writer"
}
Writer parameters
|
Parameter |
Description |
Required |
Default |
|
datasource |
The name of the data source, which must match the name you configured in DataWorks. |
Yes |
N/A |
|
table |
The name of the destination table. |
Yes |
N/A |
|
column |
The destination columns to write data to. Use commas to separate column names, for example, To write to all columns in order, use an asterisk (*), for example, |
Yes |
N/A |
|
preSql |
An SQL statement to execute before the data synchronization task starts. For example, you can use this to clear old data from the target table. You can specify one statement in the codeless UI or multiple statements in the code editor. |
No |
N/A |
|
postSql |
An SQL statement to execute after the data synchronization task finishes. For example, you can use this to add a timestamp. You can specify one statement in the codeless UI or multiple statements in the code editor. |
No |
N/A |
|
batchSize |
The number of records to write per batch. |
No |
2048 |
|
writeMode |
The write mode. Only |
No |
insert |