Synchronize vector data in a PostgreSQL database

更新时间:
复制 MD 格式

In artificial intelligence (AI) applications that use large language models (LLMs), retrieval-augmented generation (RAG) technology retrieves information from external knowledge bases. This technology significantly improves the accuracy and timeliness of the model's output. It solves problems in LLMs, such as slow knowledge updates and hallucinations, which are inaccurate answers. Compared to traditional keyword retrieval, vector retrieval provides semantic similarity search capabilities. It also supports the retrieval of unstructured and multimodal data. The pgvector extension for PostgreSQL adds vector search capabilities while retaining the original structured data management features of PostgreSQL. The data synchronization feature of Data Transmission Service (DTS) can quickly update vector data from a PostgreSQL database to other databases in different regions. This feature helps you implement cross-region disaster recovery, low-latency queries, real-time decision-making, and multi-region data analytics in data-intensive scenarios.

Prerequisites

Overview

  1. Create a database instance

    Create a destination ApsaraDB RDS for PostgreSQL instance.

  2. Create an account

    Create an account for data synchronization in the destination ApsaraDB RDS for PostgreSQL instance.

  3. Create a database

    In the destination ApsaraDB RDS for PostgreSQL instance, create a database to receive data.

  4. Install the extension

    In the destination ApsaraDB RDS for PostgreSQL instance, install the pgvector extension for the destination database to support writing vector data.

  5. Create a data synchronization instance

    Use DTS to perform data synchronization.

Preparations

Note

In this example, a self-managed PostgreSQL database that runs on a server of the Linux operating system is used.

  1. Log on to the server on which the self-managed PostgreSQL database resides.

  2. Run the following command to query the number of used replication slots in the self-managed PostgreSQL database:

    select count(1) from pg_replication_slots;
  3. Modify the postgresql.conf configuration file. Set the wal_level parameter to logical, and make sure that the values of the max_wal_senders and max_replication_slots parameters are greater than the sum of the number of used replication slots in the self-managed PostgreSQL database and the number of DTS instances whose source database is the self-managed PostgreSQL database.

    # - Settings -
    
    wal_level = logical			# minimal, replica, or logical
    					# (change requires restart)
    
    ......
    
    # - Sending Server(s) -
    
    # Set these on the master and on any standby that will send replication data.
    
    max_wal_senders = 10		# max number of walsender processes
    				# (change requires restart)
    #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
    #wal_sender_timeout = 60s	# in milliseconds; 0 disables
    
    max_replication_slots = 10	# max number of replication slots
    				# (change requires restart)
    Note

    After you modify the configuration file, restart the self-managed PostgreSQL database for the parameter settings to take effect.

  4. Add the CIDR blocks of DTS servers to the pg_hba.conf configuration file of the self-managed PostgreSQL database. Add only the CIDR blocks of the DTS servers that reside in the same region as the destination database. For more information, see DTS server IP whitelist.

    Note
    • After you modify the configuration file, execute the SELECTpg_reload_conf(); statement or restart the self-managed PostgreSQL database for the parameter to take effect.

    • For more information about the pg_hba.conf configuration file, see The pg_hba.conf File. Skip this step if the IP address in the pg_hba.conf file is set to 0.0.0.0/0. The following figure shows the configurations.

    IP

  5. Create a database and schema in the RDS instance based on the database and schema information of the objects to be synchronized. The schema names in the source and destination databases must be the same. For more information, see Create a database and Manage accounts by using schemas.

Step 1: Create a database instance

  1. Go to the RDS instance purchase page.

  2. Select the configuration parameters for the instance.

    Set Engine to PostgreSQL version 14, 15, or 16. You can select other parameters as needed. For more information, see Create an ApsaraDB RDS for PostgreSQL instance.

  3. Review the order information, quantity, and subscription duration (for subscription instances only). Click Confirm Order, and then complete the payment. A message is displayed in the console indicating that the instance is successfully created.

    Note

    For subscription instances, we recommend that you Enable Auto-renewal to prevent business interruptions caused by an expired subscription.

    Auto-renewal aligns with the subscription term (monthly or yearly) and can be disabled at any time. For details, see Renew an expired resource and Auto-renewal.

  4. View the instance.

    Go to the Instances page. In the top navigation bar, select the region where the instance is located. Find the instance that you created based on its Creation Time.

    Note

    It takes 1 to 10 minutes to create the instance. You may need to refresh the page to view the instance.

  5. View the minor engine version of the destination ApsaraDB RDS for PostgreSQL instance.

    1. After the instance is created, click the ID of the destination instance.

    2. On the Basic Information page of the destination instance, view the Minor Version Information in the Configuration Information section.

    3. Ensure that the minor engine version of the destination instance is 20230430 or later.

      Note

      If the minor engine version of the destination ApsaraDB RDS for PostgreSQL instance does not meet the requirement, you must upgrade the version. For more information, see Upgrade the minor engine version.

Step 2: Create an account

  1. Go to the Instances page. In the top navigation bar, select the region in which the RDS instance resides. Then, find the RDS instance and click the ID of the instance.

  2. In the left-side navigation pane, click Accounts.

  3. Click Create Account.

  4. Set the account parameters.

    Enter a Database Account. Set Account Type to Privileged Account. Set and confirm the account password in the New Password and Confirm Password fields. For more information, see Create an account.

  5. Click OK.

Step 3: Create a database

  1. Go to the Instances page. In the top navigation bar, select the region in which the RDS instance resides. Then, find the RDS instance and click the ID of the instance.

  2. In the left-side navigation pane, click Databases.

  3. Click Create Database.

  4. Set the database parameters.

    Enter a Database Name. For Authorized Account, select the account that you created in Step 2: Create an account. You can select other parameters as needed. For more information, see Create a database.

  5. Click Create.

Step 4: Install the extension

  1. Go to the Instances page. In the top navigation bar, select the region in which the RDS instance resides. Then, find the RDS instance and click the ID of the instance.

  2. In the left-side navigation pane, click Plug-ins.

  3. On the Plugin Marketplace tab, click AIGC, and then click Install on the vector card.

    image

  4. In the dialog box that appears, select a Database Name and a Database Account.

    Note

    For Database Name, select the database that you created in Step 3: Create a database. For Database Account, select the account that you created in Step 2: Create an account.

  5. Click Install.

    You can view the plugin installation status on the Plug-ins > Extension Management > Installed Extensions page.

    image

Step 5: Create a data synchronization instance

Note

This section uses the simple mode as an example to describe how to create a data synchronization instance. For more information, see Synchronize data from a self-managed PostgreSQL database to an ApsaraDB RDS for PostgreSQL instance.

  1. Go to the data synchronization task list page in the destination region. You can do this in one of two ways.

    DTS console

    1. Log on to the DTS console.

    2. In the navigation pane on the left, click Data Synchronization.

    3. In the upper-left corner of the page, select the region where the synchronization instance is located.

    DMS console

    Note

    The actual steps may vary depending on the mode and layout of the DMS console. For more information, see Simple mode console and Customize the layout and style of the DMS console.

    1. Log on to the DMS console.

    2. In the top menu bar, choose Data + AI > DTS (DTS) > Data Synchronization.

    3. To the right of Data Synchronization Tasks, select the region of the synchronization instance.

  2. Click Create Task to open the task configuration page.

  3. Optional: In the upper-right corner of the page, click New Configuration Page.

    Note
    • If you are already on the new configuration page (the button in the upper-right corner is Back to Previous Version), you can skip this step.

    • Some parameters differ between the new and old configuration pages. We recommend using the new version.

  4. Configure the source and destination databases.

    Category

    Configuration

    Description

    Source Database

    Database Type

    Select PostgreSQL.

    Access Method

    Select a connection type based on the deployment location of the source database. This example uses Express Connect, VPN Gateway, or Smart Access Gateway.

    Note

    If the source database is self-managed, perform the required preparations. For more information, see Preparations overview.

    Instance Region

    Select the region of the VPC where the self-managed PostgreSQL database is located.

    Replicate Data Across Alibaba Cloud Accounts

    For this example, select No, as the database instance belongs to the current Alibaba Cloud account.

    Connected VPC

    Select the VPC connected to the self-managed PostgreSQL database.

    Domain Name or IP

    Enter the IP address of the server for the self-managed PostgreSQL database.

    Port Number

    Enter the port used by the self-managed PostgreSQL database. The default port is 3433.

    Database Name

    Enter the name of the database that contains the objects to be synchronized in the self-managed PostgreSQL database.

    Database Account

    Enter an account with superuser permissions for the self-managed PostgreSQL database.

    Database Password

    Enter the password for the specified database account.

    Encryption

    Select a method as needed. This example uses the default Non-encrypted connection.

    Destination Database

    Database Type

    Select PostgreSQL.

    Access Method

    Select Alibaba Cloud Instance.

    Instance Region

    Select the region of the ApsaraDB RDS for PostgreSQL instance that you created in Step 1: Create a database instance.

    Instance ID

    Select the ID of the ApsaraDB RDS for PostgreSQL instance that you created in Step 1: Create a database instance.

    Database Name

    Enter the name of the database that you created in Step 3: Create a database.

    Database Account

    Enter the name of the account that you created in Step 2: Create an account.

    Database Password

    Enter the password for the specified database account.

    Encryption

    Select a method as needed. This example uses the default Non-encrypted connection.

  5. After completing the configuration, click Test Connectivity and Proceed at the bottom of the page.

    Note
    • Ensure that you add the CIDR blocks of the DTS servers (either automatically or manually) to the security settings of both the source and destination databases to allow access. For more information, see Add the IP address whitelist of DTS servers.

    • If the source or destination is a self-managed database (i.e., the Access Method is not Alibaba Cloud Instance), you must also click Test Connectivity in the CIDR Blocks of DTS Servers dialog box.

  6. Configure the task objects.

    1. On the Configure Objects page, specify the objects to synchronize.

      Configuration

      Description

      Synchronization Types

      Incremental Data Synchronization is selected by default. This example also selects Schema Synchronization and Full Data Synchronization.

      Synchronization Topology

      This example uses one-way synchronization. Select One-way Synchronization.

      Processing Mode of Conflicting Tables

      Keep the default value, Precheck and Report Errors.

      Source Objects

      In the Source Objects box, select the objects to be synchronized and click 向右 to move them to the Selected Objects box.

      Important

      If a table to be synchronized has a dependent sequence and no sequence with the same name exists in the destination schema, you must also select the sequence in the Source Objects box.

      Selected Objects

      No additional configuration is required for this example. Keep the default settings.

    2. Click Next: Advanced Settings.

      This example does not require any changes. You can keep the default configuration.

    3. Click Next: Data Validation to configure a data validation task.

      This example does not use the data validation feature. You can keep the default configuration.

  7. At the bottom of the page, click Next: Save Task Settings and Precheck.

  8. When the Success Rate reaches 100%, click Next: Purchase Instance.

  9. Purchase the instance.

    1. On the Purchase page, select the billing method and link specification for the data synchronization instance. The following table describes the parameters.

      This example does not require any changes. You can keep the default configuration.

    2. Select Data Transmission Service (Pay-As-You-Go) Terms of Service.

    3. Click Buy and Start, and then click OK in the OK dialog box.

      You can monitor the task progress on the data synchronization page.

Appendix

Notes

Type

Description

Source database limits

  • The tables to be synchronized must have primary keys or UNIQUE constraints, and the fields must be unique. Otherwise, duplicate data may exist in the destination database.

    Note

    If the destination table is not created by DTS (that is, Synchronization Types is not set to Schema Synchronization), you must ensure that the table has the same primary key or non-null UNIQUE constraint as the source table. Otherwise, duplicate data may occur in the destination database.

  • The name of the database to be synchronized cannot contain hyphens (-), such as dts-testdata.

  • If you synchronize data at the table level and need to edit objects, such as mapping column names, and the number of tables in a single synchronization task exceeds 5,000, split the tables into multiple tasks or configure a task to synchronize the entire database. Otherwise, a request error may be reported after you submit the task.

  • DTS does not synchronize temporary tables, internal triggers, or some functions (C language functions and internal functions for PROCEDURE and FUNCTION) from the source database. DTS synchronizes some custom data types (COMPOSITE, ENUM, or RANGE) and the following constraints: primary key, foreign key, unique, and CHECK.

  • Write-ahead log (WAL):

    • WAL must be enabled. Set the wal_level parameter to logical.

    • For an incremental synchronization task, DTS requires that the WAL logs in the source database are retained for more than 24 hours. For a task that performs both full and incremental synchronization, DTS requires that the WAL logs are retained for at least 7 days. You can change the log retention period to more than 24 hours after the initial full data synchronization is complete. If the task fails because DTS cannot obtain the required WAL logs, or in extreme cases, data inconsistency or data loss occurs, the issue is not covered by the DTS Service-Level Agreement (SLA) because the specified log retention period is shorter than required.

  • If a failover occurs on the self-managed PostgreSQL database, the synchronization fails.

  • Make sure that the values of the max_wal_senders and max_replication_slots parameters are greater than the sum of the number of replication slots in use and the number of DTS instances to be created with this self-managed PostgreSQL database as the source.

  • If the source database has long-running transactions and the instance includes an incremental synchronization task, the write-ahead logs (WALs) generated before the long-running transactions are committed cannot be cleared and may accumulate. This can cause the disk space of the source database to become insufficient.

  • When the source instance is Google Cloud Platform Cloud SQL for PostgreSQL, the Database Account for the source database must have the `cloudsqlsuperuser` permission. When you select synchronization objects, you must select objects that this account is authorized to manage, or grant the Owner permission for the objects to be synchronized to this account (for example, by running the GRANT <owner_of_the_object_to_be_synchronized> TO <source_database_account_used_by_the_task> command to allow this account to perform related operations as the object owner).

    Note

    An account with the cloudsqlsuperuser permission cannot manage data whose owner is another account with the cloudsqlsuperuser permission.

  • Due to the limits of logical subscription in the source database, if a single piece of data to be synchronized exceeds 256 MB after an incremental change, the synchronization instance may fail and cannot be recovered. You must reconfigure the synchronization instance.

  • Do not run DDL operations that change database or table schemas during schema synchronization or full synchronization. Otherwise, the synchronization task fails.

    Note

    During full synchronization, DTS queries the source database. This creates metadata locks that may block DDL operations on the source database.

  • If you perform a major engine version upgrade on the source database while the synchronization instance is running, the instance fails and cannot be recovered. You must reconfigure the synchronization instance.

Other limits

  • A single data synchronization task can synchronize only one database. To synchronize multiple databases, you must configure a data synchronization task for each database.

  • DTS does not synchronize TimescaleDB extension tables, tables with cross-schema inheritance, or tables that contain expression-based unique indexes.

  • If a table to be synchronized contains a SERIAL field, the source database automatically creates a Sequence for the field. Therefore, when you configure Source Objects, if you select Schema Synchronization for Synchronization Types, we recommend that you also select Sequence or synchronize the entire schema. Otherwise, the synchronization instance may fail.

  • In the following three scenarios, you must run the ALTER TABLE schema.table REPLICA IDENTITY FULL; command on the tables to be synchronized before you write data to them. This ensures data consistency. Do not perform table locking operations during the execution of this command. Otherwise, the tables may be locked. If you skip the related check items in the precheck, DTS automatically runs this command during the initialization of the instance.

    • When the instance runs for the first time.

    • When the synchronization granularity is schema, and a new table is created in the schema to be synchronized or a table to be synchronized is rebuilt using the RENAME command.

    • When you use the Modify Objects feature.

    Note
    • In the command, replace schema and table with the names of the schema and table to which the data to be synchronized belongs.

    • Perform this operation during off-peak hours.

  • DTS validates data content but does not validate metadata such as sequences. You must validate the metadata yourself.

  • After the business is switched to the destination, new sequences do not use the maximum sequence value from the source database as the initial value for incrementing. You must update the sequence value of the destination database before the business switchover. For more information, see Update the sequence value of the destination database.

  • DTS creates the following temporary tables in the source database to obtain the DDL statements of incremental data, the structure of incremental tables, and heartbeat information. During synchronization, do not delete these temporary tables. Otherwise, the DTS task becomes abnormal. The temporary tables are automatically deleted after the DTS instance is released.

    public.dts_pg_class, public.dts_pg_attribute, public.dts_pg_type, public.dts_pg_enum, public.dts_postgres_heartbeat, public.dts_ddl_command, public.dts_args_session, and public.aliyun_dts_instance.

  • To ensure the accuracy of the displayed synchronization latency, DTS adds a heartbeat table named dts_postgres_heartbeat to the source database.

  • During data synchronization, DTS creates a replication slot with the prefix dts_sync_ in the source database to replicate data. DTS uses this replication slot to obtain incremental logs from the source database within 15 minutes. When the data synchronization fails or the synchronization instance is released, DTS attempts to automatically clear this replication slot.

    Note
    • If you change the password of the database account used by the task or delete the DTS IP address whitelist from the source database during data synchronization, the replication slot cannot be automatically cleared. In this case, you must manually clear the replication slot in the source database to prevent it from accumulating and occupying disk space, which can make the source database unavailable.

    • If a failover occurs on the source database, you must log on to the secondary database to manually clear the replication slot.

    Amazon slot查询信息

  • Evaluate the performance of the source and destination databases before you synchronize data. Synchronize data during off-peak hours. Otherwise, initial full data synchronization consumes read and write resources on the source and destination databases, which may increase the database load.

  • Initial full data synchronization runs concurrent INSERT operations, which causes table fragmentation in the destination database. As a result, the tablespace of the destination instance is larger than that of the source instance after the initial full data synchronization is complete.

  • For table-level data synchronization, if no data other than the data from DTS is written to the destination database, you can use Data Management (DMS) to perform online DDL operations. For more information, see Perform schema changes without table locks.

  • During DTS synchronization, do not write data other than the data from DTS to the destination database. Otherwise, data inconsistency occurs between the source and destination databases. For example, if you use DMS to perform online DDL operations while other data is being written to the destination database, data may be lost in the destination database.

  • For a task that performs full or incremental synchronization, if the tables to be synchronized in the source database contain foreign keys, triggers, or event triggers, and the destination database account is a privileged account or has superuser permissions, DTS temporarily sets the session_replication_role parameter to replica at the session level during synchronization. If the destination database account does not have these permissions, you must manually set the session_replication_role parameter to replica in the destination database. During this period (when session_replication_role is set to replica), if cascade update or delete operations occur in the source database, data inconsistency may occur. After the DTS task is released, you can change the value of the session_replication_role parameter back to origin.

  • If a task fails, DTS support staff will attempt to restore it within eight hours. During restoration, they may restart the task or adjust its parameters.

    Note

    Only DTS task parameters are modified—not database parameters. Parameters that may be adjusted include those listed in Modify instance parameters.

  • When you synchronize partitioned tables, you must include both the parent table and its child tables as synchronization objects. Otherwise, data inconsistency may occur in the partitioned table.

    Note

    The parent table of a PostgreSQL partitioned table does not directly store data. All data is stored in the child tables. The synchronization task must include the parent table and all its child tables. Otherwise, data in the child tables may not be synchronized, leading to data inconsistency between the source and destination.

Billing

Synchronization type

Pricing

Schema synchronization and full data synchronization

Free of charge.

Incremental data synchronization

Charged. For more information, see Billing overview.

SQL operations supported by incremental synchronization

DML: INSERT, UPDATE, DELETE

DDL: Supported only for tasks created after May 12, 2023 (UTC+8).

Important

For tasks created before May 12, 2023 (UTC+8), you must create triggers and functions in the source database to capture DDL changes before configuring the task. See Use triggers and functions to implement incremental DDL migration for PostgreSQL databases.

Supported DDL statements (requires a privileged account and self-managed PostgreSQL minor version 20210228 or later):

  • CREATE TABLE, DROP TABLE

  • ALTER TABLE (including RENAME TABLE, ADD COLUMN, ADD COLUMN DEFAULT, ALTER COLUMN TYPE, DROP COLUMN, ADD CONSTRAINT, ADD CONSTRAINT CHECK, ALTER COLUMN DROP DEFAULT)

  • TRUNCATE TABLE (source PostgreSQL 11 or later)

  • CREATE INDEX ON TABLE

The following DDL statements and scenarios are not supported:

  • CASCADE or RESTRICT clauses in DDL statements

  • DDL from sessions that run SET session_replication_role = replica

  • DDL executed by calling a FUNCTION

  • Mixed commits containing both DML and DDL statements — the DDL is not synchronized

  • DDL for objects that are not selected for synchronization

  • DDL executed through the Server Programming Interface (SPI)

  • BIT type data during incremental data synchronization