Best practices for instant DDL to add columns in ApsaraDB RDS for PostgreSQL

更新时间:
复制 MD 格式

ALTER TABLE ... ADD COLUMN acquires an exclusive lock on the table. If another session holds a conflicting lock — from autovacuum or a long-running transaction — the ADD COLUMN statement waits. During that wait, every new query arriving on the same table is queued behind it, which can drive reads and writes to zero until the DDL completes or is canceled. This topic explains why this happens and provides two solutions that limit the impact on your workloads by configuring a lock wait timeout.

How lock contention occurs

ALTER TABLE ... ADD COLUMN requires an exclusive lock on the table, which conflicts with every other lock mode — including the locks held by concurrent reads and writes. The table below shows which common operations conflict with ALTER TABLE:

Runs concurrently with SELECT INSERT / UPDATE / DELETE VACUUM / ANALYZE ALTER TABLE / DROP TABLE
SELECT Yes Yes Yes No
INSERT / UPDATE / DELETE Yes Yes Yes No
ALTER TABLE / DROP TABLE No No No No

The two most common causes that prevent ALTER TABLE from acquiring the lock immediately are:

  • Autovacuum: Autovacuum can block DDL operations.

  • Long-running transactions: A transaction that started before the DDL and is still open holds the table lock until it commits or rolls back.

The lock queue cascade

When ALTER TABLE cannot immediately acquire the lock, it enters a wait queue. PostgreSQL then requires every subsequent query on that table to wait behind the ALTER TABLE — even queries that would normally run concurrently with each other. A DDL statement that is itself blocked therefore becomes a barrier that blocks all traffic behind it, causing a complete TPS drop until the lock is released.

Operations that trigger a full table rewrite

Most ADD COLUMN operations complete in seconds without rewriting the table:

  • No default value: Completes within seconds in all PostgreSQL versions.

  • Non-volatile default value (for example, a constant string or integer): Completes within seconds on PostgreSQL 11 and later. The default is stored in table metadata rather than written to every row. For details, see PostgreSQL documentation.

Solutions

Solution 1: Cancel the DDL operation on lock wait timeout

Set a transaction-level lock_timeout. If the DDL cannot acquire the exclusive lock within the specified period, it is automatically canceled rather than continuing to block subsequent queries.

BEGIN;

SET LOCAL lock_timeout = 500;  -- Cancel if the lock is not acquired within 500 ms.
ALTER TABLE <table_name> ADD COLUMN <column_name> VARCHAR;

COMMIT;
Important

The lock_timeout value directly affects the trade-off between DDL success rate and workload impact. A short timeout minimizes the blocking window but may require more retries. A long timeout reduces retries but blocks the table longer on each attempt. See Stress testing results for measured TPS impact at different timeout values.

Solution 2: Cancel and automatically retry until success (recommended)

Combine a lock_timeout with an automatic retry loop. The DDL retries every second until it succeeds, so each failed attempt blocks the table for at most lock_timeout milliseconds.

DO $$
DECLARE
    msg text;
BEGIN
    LOOP
        BEGIN
            PERFORM pg_sleep(1);              -- Wait 1 second before each attempt.
            SET LOCAL lock_timeout = 500;     -- Cancel if the lock is not acquired within 500 ms.
            ALTER TABLE <table_name> ADD COLUMN <column_name> VARCHAR;
            EXIT;                             -- Exit the loop on success.
        EXCEPTION WHEN OTHERS THEN
            GET STACKED DIAGNOSTICS msg = MESSAGE_TEXT;
            RAISE NOTICE 'failed, error: %s', msg;
        END;
    END LOOP;
    RAISE NOTICE 'success!';
END;
$$;

This approach is preferred over Solution 1 for production use: it eventually completes without manual intervention, and each failed attempt blocks the table for no longer than lock_timeout milliseconds.

Terminate the blocking connection (optional)

Warning

Terminating a connection disconnects it immediately. Use this approach only when the blocking session cannot be resolved by other means.

If a specific connection is blocking the DDL operation, identify and terminate it.

  1. Get the process ID (PID) of the current session.

    SELECT pg_backend_pid();
  2. Run the DDL operation.

    ALTER TABLE <table_name> ADD COLUMN <column_name> VARCHAR;
  3. Query the process that blocks the DDL operation.

    SELECT pg_blocking_pids(<The PID obtained in step 1>);
  4. Terminate the blocking process. Choose one of the following:

    • Cancel the current query without disconnecting the session:

      SELECT pg_cancel_backend(<pid_from_step_3>);
    • Forcibly disconnect the session:

      SELECT pg_terminate_backend(<pid_from_step_3>);
    Note

    Terminating an autovacuum process may fail. PostgreSQL protects autovacuum workers from forced termination in some configurations.

Choosing a lock_timeout value

lock_timeout value DDL behavior Workload impact
Not set Waits indefinitely until the lock is acquired or the statement times out TPS drops to 0 for the full wait duration
Short (for example, 100 ms) Canceled quickly if the lock is not available; may require many retries Minimal impact per attempt
Long (for example, 500 ms) Fewer retries needed but blocks the table longer on each attempt Noticeable impact per attempt

Start with 100–500 ms and adjust based on your observed retry frequency and TPS sensitivity. See Stress testing results for measured values.

Stress testing results

The following results show how lock_timeout affects TPS when lock contention occurs during ADD COLUMN. The test uses Sysbench (oltp_read_only) on a test table for 300 seconds, with long-running transactions running concurrently.

image
Stage Operation TPS impact
1 Sysbench oltp_read_only + long-running transactions Baseline
2 ADD COLUMN without lock_timeout, ~30 seconds Drops to 0
3 ADD COLUMN with lock_timeout = 100 ms (Solution 2), ~30 seconds Slightly affected
4 ADD COLUMN with lock_timeout = 500 ms (Solution 2), ~30 seconds Significantly affected
5 Terminate the long-running transaction (optional step) DDL completes immediately

Key observations:

  • Without lock_timeout, a blocked ADD COLUMN brings TPS to 0 for the entire duration of the block.

  • With lock_timeout = 100 ms, the impact is minimal because each failed attempt releases the lock quickly.

  • With lock_timeout = 500 ms, the impact is more noticeable because each attempt blocks the table for up to 500 ms.

  • Terminating the blocking transaction is the most direct way to unblock the DDL, but it disrupts the terminated session.

What's next