MySQL-Realtime Compute for Apache Flink(Flink)-阿里云帮助中心

This topic describes how to use the MySQL connector.

Background information

The MySQL connector supports all databases that are compatible with the MySQL protocol, such as ApsaraDB RDS for MySQL, PolarDB for MySQL, OceanBase (MySQL mode), and self-managed MySQL.

Important

When you use the MySQL connector to read data from OceanBase, ensure that binary logging (binlog) is enabled and correctly configured. For more information, see Binlog-related operations. This feature is in public preview. Use this feature with caution.

The MySQL connector supports the following.

Category	Details
Supported types	Source tables, dimension tables, sink tables, and data ingestion data sources
Runtime mode	Only streaming mode is supported.
Data format	Not applicable
Specific monitoring metrics	Monitoring metrics Source table currentFetchEventTimeLag: The interval from when data is generated to when it is pulled by the Source operator. This metric is valid only in the binary logging phase. In the snapshot phase, the value is always 0. currentEmitEventTimeLag: The interval from when data is generated to when it leaves the Source operator. This metric is valid only in the binary logging phase. In the snapshot phase, the value is always 0. sourceIdleTime: The duration for which the source table has not generated new data. Dimension tables and sink tables: None. Note For more information about the metrics, see Metric description.
API types	DataStream, SQL, and data ingestion YAML
Supports updating or deleting data in sink tables	Yes

Features

A MySQL change data capture (CDC) source table, also known as a MySQL streaming source table, first reads the full historical data from the database. Then, it seamlessly switches to reading binary logs. This process ensures that no data is missed or duplicated. Even if a failure occurs, data is processed with exactly-once semantics. A MySQL CDC source table supports concurrent reading of full data. It uses an incremental snapshot algorithm to implement lock-free reading and resumable data transfer. For more information, see About MySQL CDC source tables.

Unified batch and stream processing that supports reading both full and incremental data, which eliminates the need to maintain two separate processes.
Concurrent reading of full data for horizontal performance scaling.
Seamless switching from full data reading to incremental data reading and automatic scale-in to save compute resources.
Resumable data transfer during the full data reading phase for improved stability.
Lock-free reading of full data, which does not affect online services.
Support for reading backup logs of ApsaraDB RDS for MySQL.
Parallel parsing of binary log files for lower read latency.

Prerequisites

Before you use a MySQL CDC source table, you must complete the prerequisite operations described in Configure MySQL.

ApsaraDB RDS for MySQL

Perform a network probe to ensure network connectivity to Realtime Compute for Apache Flink.
MySQL version: 5.6, 5.7, 8.0.x, or 8.4.
Binary logging must be enabled. This is enabled by default.
The binary log format must be ROW. This is the default format.
The `binlog_row_image` parameter must be set to FULL. This is the default setting.
Binary Log Transaction Compression must be disabled. This feature was introduced in MySQL 8.0.20 and is disabled by default.
A MySQL user has been created with the SELECT, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT permissions.
Create a MySQL database and table. For more information, see Create a database and an account for an ApsaraDB RDS for MySQL instance. Use a privileged account to create the MySQL database to prevent operation failures due to insufficient permissions.
Configure an IP address whitelist. For more information, see Configure an IP address whitelist for an ApsaraDB RDS for MySQL instance.

PolarDB for MySQL

Perform a network probe to ensure network connectivity to Realtime Compute for Apache Flink.
MySQL version: 5.6, 5.7, 8.0.x, or 8.4.
Binary logging must be enabled. This is disabled by default.
The binary log format must be ROW. This is the default format.
The `binlog_row_image` parameter must be set to FULL. This is the default setting.
Binary Log Transaction Compression must be disabled. This feature was introduced in MySQL 8.0.20 and is disabled by default.
You have created a MySQL user with the SELECT, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT permissions.
Create a MySQL database and table. For more information, see Create a database and an account for a PolarDB for MySQL cluster. Use a privileged account to create the MySQL database to prevent operation failures due to insufficient permissions.
Configure an IP address whitelist. For more information, see Configure an IP address whitelist for a PolarDB for MySQL cluster.

Self-managed MySQL

Perform a network probe to ensure network connectivity to Realtime Compute for Apache Flink.
MySQL version: 5.6, 5.7, 8.0.x, or 8.4.
Binary logging must be enabled. This is disabled by default.
The binary log format must be ROW. The default format is STATEMENT.
The `binlog_row_image` parameter must be set to FULL. This is the default setting.
Binary Log Transaction Compression must be disabled. This feature was introduced in MySQL 8.0.20 and is disabled by default.
Create a MySQL user and grant the SELECT, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT permissions.
Create a MySQL database and table. For more information, see Create a database and an account for a self-managed MySQL instance. Use a privileged account to create the MySQL database to prevent operation failures due to insufficient permissions.
Configure an IP address whitelist. For more information, see Configure an IP address whitelist for a self-managed MySQL instance.

Limits

General limits

MySQL CDC source tables do not support watermark definitions. To perform window aggregation, you can use a non-window aggregation method. For more information, see How do I perform window aggregation if watermark definitions are not supported?.
In Create Table As Select (CTAS) and Create Database As Select (CDAS) jobs, MySQL CDC source tables can synchronize some schema changes. For more information about the supported change types, see Schema evolution synchronization policies.
The MySQL CDC connector does not support the Binary Log Transaction Compression feature. Therefore, when you use the MySQL CDC connector to consume incremental data, ensure that Binary Log Transaction Compression is disabled. Otherwise, the connector may fail to retrieve incremental data.

ApsaraDB RDS for MySQL limits

For ApsaraDB RDS for MySQL, do not read data from a secondary database or read-only replica. This is because the default binary log retention period for secondary databases and read-only replicas is short. If binary logs expire and are cleared, the job may fail to consume the binary log data and report an error.
ApsaraDB RDS for MySQL enables parallel primary/secondary synchronization by default but does not guarantee a consistent transaction order between the primary and secondary instances. This may cause data to be missed during a primary/secondary switchover and checkpoint recovery. To avoid this issue, you can manually enable the `slave_preserve_commit_order` option for ApsaraDB RDS for MySQL.

PolarDB for MySQL limits

MySQL CDC source tables do not support reading data from Multi-master Cluster Architecture clusters of PolarDB for MySQL V1.0.19 and earlier. For more information, see What is a Multi-master Cluster?. The binary logs generated by these clusters may contain duplicate table IDs. This can cause schema mapping errors in the CDC source table, which leads to errors when parsing binary log data.

Open source MySQL limits

By default, MySQL maintains the transaction order during primary/secondary binary log replication. If a MySQL replica has parallel replication enabled (slave_parallel_workers > 1) but does not have slave_preserve_commit_order=ON enabled, its transaction commit order may be inconsistent with the primary database. When Flink CDC recovers from a checkpoint, it may miss data because of the disordered sequence. You can set `slave_preserve_commit_order` = ON on the MySQL replica. Alternatively, you can set `slave_parallel_workers` = 1, but this will sacrifice replication performance.

Usage notes

Source table
- Each MySQL CDC data source requires a unique server ID.
  Purpose of Server ID
  
  Each MySQL CDC data source requires a unique server ID. If multiple MySQL CDC data sources share the same server ID and cannot be reused, the binary log offsets can become disordered. This can lead to data being read more than once or being missed.
  
  Server ID configuration for different scenarios
  
  You can specify the server ID in the Data Definition Language (DDL) statement. However, we recommend that you configure the server ID using dynamic hints instead of DDL parameters.
  - Degree of parallelism = 1 or incremental snapshot is disabled
    
    ## If the incremental snapshot framework is disabled or the degree of parallelism is 1, you can specify a specific Server ID. SELECT * FROM source_table /*+ OPTIONS('server-id'='123456') */ ;
  - Degree of parallelism > 1 and incremental snapshot is enabled
    
    ## You must specify a Server ID range. The number of available Server IDs in the range must be greater than or equal to the degree of parallelism. Assume the degree of parallelism is 3. SELECT * FROM source_table /*+ OPTIONS('server-id'='123456-123458') */ ;
  - CTAS for data synchronization
    
    When you use CTAS for data synchronization, if the CDC data sources have the same configuration, the data sources are automatically reused. In this case, you can configure the same server ID for multiple CDC data sources. For more information, see Example 4: Multiple CTAS statements.
  - Multiple non-CTAS source tables that cannot be reused
    
    If a job contains multiple MySQL CDC source tables and does not use CTAS statements for synchronization, the data sources cannot be reused. You must provide a different server ID for each CDC source table. Similarly, if the incremental snapshot framework is enabled and the degree of parallelism is greater than 1, you must specify a server ID range.
    
    select * from source_table1 /*+ OPTIONS('server-id'='123456-123457') */ left join source_table2 /*+ OPTIONS('server-id'='123458-123459') */ on source_table1.id=source_table2.id;
- During the full data reading phase, you cannot save a savepoint, add a table to or delete a table from the source table, and then restart the job from the savepoint. If you perform these operations, the job will fail to read data.

Sink table
- Auto-increment primary keys: Do not declare auto-increment primary keys in the DDL. MySQL automatically populates them when writing data.
- You must declare at least one non-primary key field. Otherwise, an error is reported.
- The `NOT ENFORCED` constraint in the DDL indicates that Flink does not enforce primary key validation. You are responsible for ensuring the correctness and integrity of the primary key. For more information, see Validity Check.
Dimension table

If you want to use an index to accelerate queries, the order of fields in the JOIN clause must match the order defined in the index. This is based on the leftmost prefix rule. For example, if the index is (a, b, c), the JOIN condition is ON t.a = x AND t.b = y.

The SQL generated by Flink may be rewritten by the optimizer. This can prevent the index from being hit during the actual database query. To confirm whether the index is used, check the execution plan (EXPLAIN) or the slow query log in MySQL to view the actual SELECT statement that is executed.

SQL

You can use the MySQL connector in SQL jobs as a source table, dimension table, or sink table.

Syntax

CREATE TEMPORARY TABLE mysqlcdc_source (
   order_id INT,
   order_date TIMESTAMP(0),
   customer_name STRING,
   price DECIMAL(10, 5),
   product_id INT,
   order_status BOOLEAN,
   PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
  'connector' = 'mysql',
  'hostname' = '<yourHostname>',
  'port' = '3306',
  'username' = '<yourUsername>',
  'password' = '<yourPassword>',
  'database-name' = '<yourDatabaseName>',
  'table-name' = '<yourTableName>'
);

Note

When writing to a sink table, the connector constructs and executes an SQL statement for each received data record. The statement is structured as follows:
- For a sink table without a primary key, an INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...); statement is executed.
- For a sink table with a primary key, an INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...) ON DUPLICATE KEY UPDATE column1 = VALUES(column1), column2 = VALUES(column2), ...; statement is executed. Note: If the physical table has a unique index constraint other than the primary key, inserting two records with different primary keys but the same unique index value causes a unique index conflict. This results in data being overwritten and lost.
If an auto-increment primary key is defined in the MySQL database, do not declare the auto-increment field in the Flink DDL. The database automatically populates this field when writing data. The connector supports writing and deleting data with auto-increment fields, but does not support updating this data.

WITH parameters

General

Parameter	Description	Required	Data type	Default value	Notes
connector	The table type.	Yes	STRING	None	When used as a source table, you can set this parameter to `mysql-cdc` or `mysql`. They are equivalent. When used as a dimension table or sink table, the value must be `mysql`.
hostname	The IP address or hostname of the MySQL database.	Yes	STRING	None	We recommend that you specify a virtual private cloud (VPC) address. Note If the MySQL database and Realtime Compute for Apache Flink are not in the same VPC, you must establish a cross-VPC network connection or use a public endpoint to access the database. For more information, see Manage and operate workspaces and How can a fully managed Flink cluster access the Internet?.
username	The username for the MySQL database service.	Yes	STRING	None	None.
password	The password for the MySQL database service.	Yes	STRING	None	None.
database-name	The name of the MySQL database.	Yes	STRING	None	When a database is used as a source table, you can use a regular expression for the database name to read data from multiple databases. When you use regular expressions, do not use the ^ and $ symbols to match the beginning and end of the string. For more information, see the notes for the table-name parameter.
table-name	The name of the MySQL table.	Yes	STRING	None	You can use a regular expression for the source table name to read data from multiple tables. When you read data from multiple MySQL tables, submit multiple CTAS statements as a single job. This avoids enabling multiple binary log listeners and improves performance and efficiency. For more information, see Multiple CTAS statements: Submit as a single job. When you use regular expressions, do not use the ^ and $ symbols to match the beginning and end of the string. For more information, see the following note. Note When a MySQL CDC source table matches table names using a regular expression, it concatenates the database-name and table-name that you specify with the string \\. to form a full-path regular expression. Before VVR 8.0.1, the character . was used. The connector then uses this regular expression to match the fully qualified names of tables in the MySQL database. For example, if you set 'database-name'='db_.' and 'table-name'='tb_.+', the connector uses the regular expression db_.\\.tb_.+ to match the fully qualified table names to determine which tables to read. Before VVR 8.0.1, the regular expression was db_.*.tb_.+.
port	The port number of the MySQL database service.	No	INTEGER	3306	None.

Source table only

Parameter	Description	Required	Data type	Default value	Notes
server-id	A numeric ID for the database client.	No	STRING	A random value between 5400 and 6400 is generated.	This ID must be globally unique within the MySQL cluster. Set a different ID for each job that connects to the same database. This parameter also supports an ID range format, such as 5400-5408. When incremental reading is enabled, concurrent reading is supported. In this case, set an ID range so that each concurrent reader uses a different ID. For more information, see Use Server ID.
scan.incremental.snapshot.enabled	Specifies whether to enable incremental snapshots.	No	BOOLEAN	true	Incremental snapshots are enabled by default. Incremental snapshot is a new mechanism for reading full data snapshots. Compared to the old snapshot reading method, incremental snapshots have many advantages, including the following: The source can read full data in parallel. The source supports chunk-level checkpoints when reading full data. The source does not need to acquire a global read lock (FLUSH TABLES WITH read lock) when reading full data. If you want the source to support concurrent reading, each concurrent reader needs a unique server ID. Therefore, server-id must be a range, such as 5400-6400, and the size of the range must be greater than or equal to the concurrency. Note This configuration item is removed in Ververica Runtime (VVR) 11.1 and later.
scan.incremental.snapshot.chunk.size	The size of each chunk in number of rows.	No	INTEGER	8096	When incremental snapshot reading is enabled, the table is split into multiple chunks for reading. The data of a chunk is cached in memory before it is fully read. The fewer rows each chunk contains, the greater the total number of chunks in the table. Although this reduces the granularity of fault recovery, it may lead to out-of-memory (OOM) errors and lower overall throughput. Therefore, you need to make a trade-off and set a reasonable chunk size.
scan.snapshot.fetch.size	The maximum number of records to pull at a time when reading the full data of a table.	No	INTEGER	1024	None.
scan.startup.mode	The startup mode for data consumption.	No	STRING	initial	Valid values: initial (default): On the first startup, the connector scans the full historical data and then reads the latest binary log data. latest-offset: On the first startup, the connector does not scan the historical data. It starts reading from the end of the binary log, which means it only reads the latest changes made after the connector starts. earliest-offset: The connector does not scan the historical data. It starts reading from the earliest available binary log. specific-offset: The connector does not scan the historical data. It starts from a specific binary log offset. You can specify the offset by configuring both scan.startup.specific-offset.file and scan.startup.specific-offset.pos, or by configuring only scan.startup.specific-offset.gtid-set to start from a specific GTID set. timestamp: The connector does not scan the historical data. It starts reading the binary log from a specified timestamp. The timestamp is specified by scan.startup.timestamp-millis in milliseconds. Important When using the earliest-offset, specific-offset, or timestamp startup mode, ensure that the schema of the corresponding table does not change between the specified binary log consumption position and the job startup time. This prevents errors caused by schema mismatches.
scan.startup.specific-offset.file	The binary log filename for the start offset when using the specific-offset startup mode.	No	STRING	None	When you use this parameter, you must set scan.startup.mode to specific-offset. Example filename format: `mysql-bin.000003`.
scan.startup.specific-offset.pos	The offset within the specified binary log file for the start offset when using the specific-offset startup mode.	No	INTEGER	None	When you use this parameter, you must set scan.startup.mode to specific-offset.
scan.startup.specific-offset.gtid-set	The GTID set for the start offset when using the specific-offset startup mode.	No	STRING	None	When you use this parameter, you must set scan.startup.mode to specific-offset. Example GTID set format: `24DA167-0C0C-11E8-8442-00059A3C7B00:1-19`.
scan.startup.timestamp-millis	The timestamp in milliseconds for the start offset when using the timestamp startup mode.	No	LONG	None	When you use this parameter, you must set scan.startup.mode to timestamp. The timestamp unit is milliseconds. Important When you specify a time, MySQL CDC attempts to read the initial event of each binary log file to determine its timestamp. It then locates the binary log file corresponding to the specified time. Make sure that the binary log file corresponding to the specified timestamp has not been cleared from the database and can be read.
server-time-zone	The session time zone used by the database.	No	STRING	If you do not specify this parameter, the system uses the environment time zone of the Flink job runtime as the database server time zone. This is the time zone of the zone you selected.	Example: Asia/Shanghai. This parameter controls how the TIMESTAMP type in MySQL is converted to the STRING type. For more information, see Debezium temporal values.
debezium.min.row.count.to.stream.results	When the number of rows in a table is greater than this value, batch reading mode is used.	No	INTEGER	1000	Flink reads data from a MySQL source table in one of the following ways: Full read: Reads the entire table's data directly into memory. This method is fast but consumes a corresponding amount of memory. If the source table is very large, there is a risk of OOM errors. Batch read: Reads data in multiple batches, with a certain number of rows per batch, until all data is read. This method avoids OOM risks when reading large tables but is relatively slow.
connect.timeout	The maximum time to wait for a connection to the MySQL database server to time out before retrying.	No	DURATION	30s	None.
connect.max-retries	The maximum number of retries after a failed connection to the MySQL database service.	No	INTEGER	3	None.
connection.pool.size	The size of the database connection pool.	No	INTEGER	20	The database connection pool is used to reuse connections, which can reduce the number of database connections.
jdbc.properties.*	Custom connection parameters in the JDBC URL.	No	STRING	None	You can pass custom connection parameters. For example, to not use the SSL protocol, you can configure 'jdbc.properties.useSSL' = 'false'. For more information about the supported connection parameters, see MySQL Configuration Properties.
debezium.*	Custom parameters for Debezium to read binary logs.	No	STRING	None	You can pass custom Debezium parameters. For example, use 'debezium.event.deserialization.failure.handling.mode'='ignore' to specify the handling logic for parsing errors. Warning Do not modify Debezium parameters arbitrarily. This may cause the connector to read data incorrectly. For example, the debezium.binlog.buffer.size parameter is not allowed to be configured.
heartbeat.interval	The interval at which the source advances the binary log offset using heartbeat events.	No	DURATION	30s	Heartbeat events are used to advance the binary log offset in the source. This is very useful for tables in MySQL that are updated infrequently. For such tables, the binary log offset cannot advance automatically. Heartbeat events can push the binary log offset forward, which prevents issues caused by an expired binary log offset. An expired binary log offset can cause the job to fail and be unrecoverable, requiring a stateless restart.
scan.incremental.snapshot.chunk.key-column	Specifies a column to be used as the splitting column for sharding during the snapshot phase.	See the Notes column.	STRING	None	Required for tables without a primary key. The selected column must be of a non-null type (NOT NULL). Optional for tables with a primary key. Only one column can be selected from the primary key.
rds.region-id	The region ID of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information about region IDs, see Regions and zones. Important Because the GTID string for MySQL CDC is randomly generated and not monotonically increasing like binary log file offsets, locating a GTID in a file requires downloading and parsing all archived logs from OSS. This process is very resource-intensive and time-consuming, making features that rely on GTID offsets infeasible. Therefore, the OSS archived log feature only supports starting from a specified timestamp or a specified binary log file offset. It does not support starting from a specified GTID, nor does it support scenarios with primary/secondary switchovers in the archived logs, because MySQL primary/secondary switchovers rely on GTIDs. Evaluate this feature carefully before use.
rds.access-key-id	The AccessKey ID of the Alibaba Cloud ApsaraDB RDS for MySQL account.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information, see How do I view the AccessKey ID and AccessKey secret?. Important To prevent your AccessKey information from being leaked, use the secret management feature to specify the AccessKey ID. For more information, see Manage variables.
rds.access-key-secret	The AccessKey secret of the Alibaba Cloud ApsaraDB RDS for MySQL account.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information, see How do I view the AccessKey ID and AccessKey secret? Important To prevent your AccessKey information from being leaked, use the secret management feature to specify the AccessKey secret. For more information, see Manage variables.
rds.db-instance-id	The ID of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	Required when using the feature to read archived logs from OSS.	STRING	None	None.
rds.main-db-id	The primary database number of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	No	STRING	None	For more information about how to obtain the primary database number, see ApsaraDB RDS for MySQL log backup. Supported only in VVR 8.0.7 and later. Note If this parameter is not specified, VVR 11.7 and later automatically query the primary database number based on the ApsaraDB RDS for MySQL connection information.
rds.download.timeout	The timeout period for downloading a single archived log from OSS.	No	DURATION	60s	None.
rds.endpoint	The service endpoint for obtaining OSS binary log information.	No	STRING	None	For more information about the valid values, see Endpoints. Supported only in VVR 8.0.8 and later.
scan.incremental.close-idle-reader.enabled	Specifies whether to close idle readers after the snapshot is complete.	No	BOOLEAN	false	Supported only in VVR 8.0.1 and later. For this configuration to take effect, you must set execution.checkpointing.checkpoints-after-tasks-finish.enabled to true.
scan.read-changelog-as-append-only.enabled	Specifies whether to convert the changelog data stream to an append-only data stream.	No	BOOLEAN	false	Valid values: true: All types of messages, including INSERT, DELETE, UPDATE_BEFORE, and UPDATE_AFTER, are converted to INSERT messages. Enable this option only in special scenarios, such as when you need to save delete messages from the upstream table. false (default): All types of messages are sent downstream as they are. Note Supported only in VVR 8.0.8 and later.
scan.only.deserialize.captured.tables.changelog.enabled	In the incremental phase, specifies whether to deserialize only the change events of the specified tables.	No	BOOLEAN	The default value is false in VVR 8.x versions. The default value is true in VVR 11.1 and later.	Valid values: true: Deserializes only the change data of the target tables to accelerate binary log reading. false (default): Deserializes the change data of all tables. Note Supported only in VVR 8.0.7 and later. When you use VVR 8.0.8 or earlier, you must change the parameter name to debezium.scan.only.deserialize.captured.tables.changelog.enable.
scan.parse.online.schema.changes.enabled	In the incremental phase, specifies whether to attempt to parse RDS lockless change DDL events.	No	BOOLEAN	false	Valid values: true: Parses RDS lockless change DDL events. false (default): Does not parse RDS lockless change DDL events. This is an experimental feature. Before performing an online lockless change, take a snapshot of the Flink job for recovery. Note Supported only in VVR 11.1 and later.
scan.incremental.snapshot.backfill.skip	Specifies whether to skip backfill during the snapshot reading phase.	No	BOOLEAN	false	Valid values: true: Skips backfill during the snapshot reading phase. false (default): Does not skip backfill during the snapshot reading phase. Backfill applies only during the snapshot query of a single chunk and does not cover the entire full-read phase. When backfill is skipped, each chunk's snapshot query reads the latest table data at that instant; updates that occur on a chunk after it has been read are not merged during the full-read phase and are read from the Binlog after entering the incremental phase. For example, an update to chunk5 that occurs while chunk5 is being snapshotted is reflected directly in chunk5's snapshot; if chunk5 is updated after the reader has advanced to chunk80, the update is applied later from the Binlog during the incremental phase. Important When enabled, changes that occur during or after a chunk's scan are still delivered from the Binlog in the incremental phase and may be duplicated. Only at-least-once semantics are guaranteed. Enable this only when the downstream sink supports idempotent writes by primary key. Note Supported only in VVR 11.1 and later.
scan.incremental.snapshot.unbounded-chunk-first.enabled	Specifies whether to dispatch unbounded chunks first during the snapshot reading phase.	No	BOOELEAN	false	Valid values: true: Dispatches unbounded chunks first during the snapshot reading phase. false (default): Does not dispatch unbounded chunks first during the snapshot reading phase. This is an experimental feature. Enabling it can reduce the risk of OOM errors on the TaskManager when synchronizing the last chunk during the snapshot phase. Add this parameter before the first startup of the job. Note Supported only in VVR 11.1 and later.
binlog.session.network.timeout	The network read/write timeout for the binary log connection.	No	DURATION	10m	If set to 0s, the default timeout of the MySQL server is used. Note Supported only in VVR 11.5 and later.
scan.rate-limit.records-per-second	Limits the maximum number of records sent by the source per second.	No	LONG	None	This is applicable to scenarios where data reading needs to be limited. This limit is effective in both the full and incremental phases. The `numRecordsOutPerSecond` metric of the source reflects the number of records output by the entire data stream per second. You can adjust this parameter based on this metric. In the full data reading phase, you usually need to reduce the number of rows read in each batch. You can reduce the value of the `scan.incremental.snapshot.chunk.size` parameter. Note Supported only in VVR 11.5 and later.
scan.binlog.tolerate.gtid-holes	Enabling this parameter ignores gaps in the GTID sequence, allowing the job to bypass discontinuous events and continue running.	No	BOOLEAN	false	Before enabling this parameter, you must ensure that the job's start offset has not expired. If the job starts from a cleared or expired GTID offset, the engine will silently skip the missing logs, which will lead to data loss. Note This parameter is supported only in VVR 11.6 and later.

Dimension table-specific parameters

Parameter	Description	Required	Data type	Default value	Notes
url	The MySQL JDBC URL.	No	STRING	None	The URL format is: `jdbc:mysql://<endpoint>:<port>/<database_name>`.
lookup.max-retries	The maximum number of retries after a failed data read.	No	INTEGER	3	Supported only in VVR 6.0.7 and later.
lookup.cache.strategy	The cache policy.	No	STRING	None	The supported cache policies are None, LRU, and ALL. For more information about the values, see Dimension table JOIN statements. Note When you use the LRU cache policy, you must also configure the lookup.cache.max-rows parameter.
lookup.cache.max-rows	The maximum number of cached rows.	No	INTEGER	100000	If you select the LRU cache policy, you must set the cache size. If you select the ALL cache policy, you do not need to set the cache size.
lookup.cache.ttl	The cache time-to-live (TTL).	No	DURATION	10 s	The configuration of lookup.cache.ttl depends on lookup.cache.strategy: If lookup.cache.strategy is set to None, you do not need to configure lookup.cache.ttl. This means the cache does not time out. If lookup.cache.strategy is set to LRU, lookup.cache.ttl is the cache TTL. By default, the cache does not expire. If lookup.cache.strategy is set to ALL, lookup.cache.ttl is the cache loading time. By default, the cache is not reloaded. Use a time format, such as 1min or 10s.
lookup.max-join-rows	The maximum number of results returned when a record from the primary table matches records in the dimension table.	No	INTEGER	1024	None.
lookup.filter-push-down.enabled	Specifies whether to enable filter pushdown for the dimension table.	No	BOOLEAN	false	Valid values: true: Enables filter pushdown for the dimension table. When loading data from the MySQL database table, the dimension table filters data in advance based on the conditions set in the SQL job. false (default): Disables filter pushdown for the dimension table. When loading data from the MySQL database table, the dimension table loads all data. Note Supported only in VVR 8.0.7 and later. Important Dimension table pushdown should only be enabled when a Flink table is used as a dimension table. MySQL source tables do not support enabling filter pushdown. If a Flink table is used as both a source table and a dimension table, and filter pushdown is enabled for the dimension table, you must explicitly set this configuration item to false for the source table using SQL Hints. Otherwise, the job may run abnormally.

For sink tables only

Parameter	Description	Required	Data type	Default value	Notes
url	The MySQL JDBC URL.	No	STRING	None	The URL format is: `jdbc:mysql://<endpoint>:<port>/<database_name>`.
sink.max-retries	The maximum number of retries after a failed data write.	No	INTEGER	3	None.
sink.buffer-flush.batch-size	The number of rows in a single batch write.	No	INTEGER	4096	None.
sink.buffer-flush.max-rows	The number of data rows cached in memory.	No	INTEGER	10000	This parameter takes effect only after a primary key is specified.
sink.buffer-flush.interval	The interval for flushing the cache. If the data in the cache does not meet the output conditions after the specified waiting time, the system automatically outputs all data in the cache.	No	DURATION	1s	None.
sink.ignore-delete	Specifies whether to ignore data DELETE operations.	No	BOOLEAN	false	When the stream generated by Flink SQL includes delete or update-before records, if multiple output tasks update different fields of the same table simultaneously, data inconsistency may occur. For example, after a record is deleted, another task updates only some fields. The un-updated fields will become null or default values, causing data errors. By setting sink.ignore-delete to true, you can ignore upstream DELETE and UPDATE_BEFORE operations to avoid such issues. Note UPDATE_BEFORE is part of Flink's retraction mechanism, used to "retract" the old value in an update operation. When ignoreDelete = true, all DELETE and UPDATE_BEFORE type records are skipped. Only INSERT and UPDATE_AFTER records are processed.
sink.ignore-null-when-update	When updating data, specifies whether to update the corresponding field to null or skip the update for that field if the incoming data field value is null.	No	BOOLEAN	false	Valid values: true: Does not update the field. This parameter can be set to true only when a primary key is set for the Flink table. When set to true: For VVR 8.0.6 and earlier, the sink table does not support batch writing. For VVR 8.0.7 and later, the sink table supports batch writing. Batch writing can significantly improve write efficiency and overall throughput, but it introduces data latency and the risk of OOM errors. Therefore, you must make a trade-off based on your business scenario. false: Updates the field to null. Note This parameter is supported only in VVR 8.0.5 and later.

Type mapping

CDC source tables

MySQL CDC field type	Flink field type
TINYINT	TINYINT
SMALLINT	SMALLINT
TINYINT UNSIGNED
TINYINT UNSIGNED ZEROFILL
INT	INT
MEDIUMINT
SMALLINT UNSIGNED
SMALLINT UNSIGNED ZEROFILL
BIGINT	BIGINT
INT UNSIGNED
INT UNSIGNED ZEROFILL
MEDIUMINT UNSIGNED
MEDIUMINT UNSIGNED ZEROFILL
BIGINT UNSIGNED	DECIMAL(20, 0)
BIGINT UNSIGNED ZEROFILL
SERIAL
FLOAT [UNSIGNED] [ZEROFILL]	FLOAT
DOUBLE [UNSIGNED] [ZEROFILL]	DOUBLE
DOUBLE PRECISION [UNSIGNED] [ZEROFILL]
REAL [UNSIGNED] [ZEROFILL]
NUMERIC(p, s) [UNSIGNED] [ZEROFILL]	DECIMAL(p, s)
DECIMAL(p, s) [UNSIGNED] [ZEROFILL]	DECIMAL(p, s)
BOOLEAN	BOOLEAN
TINYINT(1)	BOOLEAN
DATE	DATE
TIME [(p)]	TIME [(p)] [WITHOUT TIME ZONE]
DATETIME [(p)]	TIMESTAMP [(p)] [WITHOUT TIME ZONE]
TIMESTAMP [(p)]	TIMESTAMP [(p)]
TIMESTAMP [(p)]	TIMESTAMP [(p)] WITH LOCAL TIME ZONE
CHAR(n)	STRING
VARCHAR(n)
TEXT
BINARY	BYTES
VARBINARY
BLOB

Important

Do not use the TINYINT(1) type in MySQL to store values other than 0 and 1. When property-version=0, the MySQL CDC source table maps TINYINT(1) to the BOOLEAN type in Flink by default. This can cause data inaccuracies. To use the TINYINT(1) type to store values other than 0 and 1, see the configuration parameter catalog.table.treat-tinyint1-as-boolean.

Dimension tables and sink tables

MySQL field type	Flink field type
TINYINT	TINYINT
SMALLINT	SMALLINT
TINYINT UNSIGNED	SMALLINT
INT	INT
MEDIUMINT
SMALLINT UNSIGNED
BIGINT	BIGINT
INT UNSIGNED	BIGINT
BIGINT UNSIGNED	DECIMAL(20, 0)
FLOAT	FLOAT
DOUBLE	DOUBLE
DOUBLE PRECISION	DOUBLE
NUMERIC(p, s)	DECIMAL(p, s) Note where p <= 38.
DECIMAL(p, s)	DECIMAL(p, s) Note where p <= 38.
BOOLEAN	BOOLEAN
TINYINT(1)	BOOLEAN
DATE	DATE
TIME [(p)]	TIME [(p)] [WITHOUT TIME ZONE]
DATETIME [(p)]	TIMESTAMP [(p)] [WITHOUT TIME ZONE]
TIMESTAMP [(p)]	TIMESTAMP [(p)] [WITHOUT TIME ZONE]
CHAR(n)	CHAR(n)
VARCHAR(n)	VARCHAR(n)
BIT(n)	BINARY(⌈n/8⌉)
BINARY(n)	BINARY(n)
VARBINARY(N)	VARBINARY(N)
TINYTEXT	STRING
TEXT
MEDIUMTEXT
LONGTEXT
TINYBLOB	BYTES Important Flink only supports MySQL BLOB type records that are less than or equal to 2,147,483,647 (2^31 - 1) bytes.
BLOB
MEDIUMBLOB
LONGBLOB

Data ingestion

You can use the MySQL connector as a data source in a data ingestion YAML job.

Syntax

source:
   type: mysql
   name: MySQL Source
   hostname: localhost
   port: 3306
   username: <username>
   password: <password>
   tables: adb.\.*, bdb.user_table_[0-9]+, [app|web].order_\.*
   server-id: 5401-5404

sink:
  type: xxx

Configuration items

Parameter	Description	Required	Data type	Default value	Notes
type	The data source type.	Yes	STRING	None	The value must be mysql.
name	The data source name.	No	STRING	None	None.
hostname	The IP address or hostname of the MySQL database.	Yes	STRING	None	We recommend that you specify a VPC address. Note If the MySQL database and Realtime Compute for Apache Flink are not in the same VPC, you must establish a cross-VPC network connection or use a public endpoint to access the database. For more information, see Manage and operate workspaces and How can a fully managed Flink cluster access the Internet?.
username	The username for the MySQL database service.	Yes	STRING	None	None.
password	The password for the MySQL database service.	Yes	STRING	None	None.
tables	The MySQL data tables to be synchronized.	Yes	STRING	None	This parameter supports regular expressions to read data from multiple tables. You can use commas to separate multiple regular expressions. Note Do not use the start-of-string `^` and end-of-string `$` matching characters in the regular expression. In VVR 11.2, the period is used to split the regular expression to get the database part. Start and end matching characters will make the resulting database regular expression unusable. For example, you must change `^db.user_[0-9]+$` to `db.user_[0-9]+`. The period is used to separate the database name and table name. To use a period to match any character, you must escape it with a backslash. For example: db0.\., db1.user_table_[0-9]+, db[1-2].[app\|web]order_\..
tables.exclude	The tables to be excluded from synchronization.	No	STRING	None	This parameter supports regular expressions to exclude multiple tables. You can use commas to separate multiple regular expressions. Note The period is used to separate the database name and table name. To use a period to match any character, you must escape it with a backslash. For example: db0.\., db1.user_table_[0-9]+, db[1-2].[app\|web]order_\..
port	The port number of the MySQL database service.	No	INTEGER	3306	None.
schema-change.enabled	Specifies whether to send schema change events.	No	BOOLEAN	true	None.
server-id	A numeric ID or range for the database client used for synchronization.	No	STRING	A random value between 5400 and 6400 is generated.	This ID must be globally unique within the MySQL cluster. Set a different ID for each job that connects to the same database. This parameter also supports an ID range format, such as 5400-5408. When incremental reading is enabled, concurrent reading is supported. In this case, set an ID range so that each concurrent reader uses a different ID.
jdbc.properties.*	Custom connection parameters in the JDBC URL.	No	STRING	None	You can pass custom connection parameters. For example, to not use the SSL protocol, you can configure 'jdbc.properties.useSSL' = 'false'. For more information about the supported connection parameters, see MySQL Configuration Properties.
debezium.*	Custom parameters for Debezium to read binary logs.	No	STRING	None	You can pass custom Debezium parameters. For example, use 'debezium.event.deserialization.failure.handling.mode'='ignore' to specify the handling logic for parsing errors. Warning Do not modify Debezium parameters arbitrarily. This may cause the connector to read data incorrectly. For example, the debezium.binlog.buffer.size parameter is not allowed to be configured.
scan.incremental.snapshot.chunk.size	The size of each chunk in number of rows.	No	INTEGER	8096	MySQL tables are split into multiple chunks for reading. The data of a chunk is cached in memory before it is fully read. The fewer rows each chunk contains, the greater the total number of chunks in the table. Although this reduces the granularity of fault recovery, it may lead to OOM errors and lower overall throughput. Therefore, you need to make a trade-off and set a reasonable chunk size.
scan.snapshot.fetch.size	The maximum number of records to pull at a time when reading the full data of a table.	No	INTEGER	1024	None.
scan.startup.mode	The startup mode for data consumption.	No	STRING	initial	Valid values: initial (default): On the first startup, the connector scans the full historical data and then reads the latest binary log data. latest-offset: On the first startup, the connector does not scan the historical data. It starts reading from the end of the binary log, which means it only reads the latest changes made after the connector starts. earliest-offset: The connector does not scan the historical data. It starts reading from the earliest available binary log. specific-offset: The connector does not scan the historical data. It starts from a specific binary log offset. You can specify the offset by configuring both scan.startup.specific-offset.file and scan.startup.specific-offset.pos, or by configuring only scan.startup.specific-offset.gtid-set to start from a specific GTID set. timestamp: The connector does not scan the historical data. It starts reading the binary log from a specified timestamp. The timestamp is specified by scan.startup.timestamp-millis in milliseconds. Important For the earliest-offset, specific-offset, and timestamp startup modes, if the table schema at the startup time is different from the schema at the specified start offset time, the job will report an error due to the schema mismatch. In other words, when using these three startup modes, you must ensure that the schema of the corresponding table does not change between the specified binary log consumption position and the job startup time.
scan.startup.specific-offset.file	The binary log filename for the start offset when using the specific-offset startup mode.	No	STRING	None	When you use this parameter, you must set scan.startup.mode to specific-offset. Example filename format: `mysql-bin.000003`.
scan.startup.specific-offset.pos	The offset within the specified binary log file for the start offset when using the specific-offset startup mode.	No	INTEGER	None	When you use this parameter, you must set scan.startup.mode to specific-offset.
scan.startup.specific-offset.gtid-set	The GTID set for the start offset when using the specific-offset startup mode.	No	STRING	None	When you use this parameter, you must set scan.startup.mode to specific-offset. Example GTID set format: `24DA167-0C0C-11E8-8442-00059A3C7B00:1-19`.
scan.startup.timestamp-millis	The timestamp in milliseconds for the start offset when using the timestamp startup mode.	No	LONG	None	When you use this parameter, you must set scan.startup.mode to timestamp. The timestamp unit is milliseconds. Important When you specify a time, MySQL CDC attempts to read the initial event of each binary log file to determine its timestamp. It then locates the binary log file corresponding to the specified time. Make sure that the binary log file corresponding to the specified timestamp has not been cleared from the database and can be read.
server-time-zone	The session time zone used by the database.	No	STRING	If you do not specify this parameter, the system uses the environment time zone of the Flink job runtime as the database server time zone. This is the time zone of the zone you selected.	Example: Asia/Shanghai. This parameter controls how the TIMESTAMP type in MySQL is converted to the STRING type. For more information, see Debezium temporal values.
scan.startup.specific-offset.skip-events	The number of binary log events to skip when reading from a specified offset.	No	INTEGER	None	When you use this parameter, you must set scan.startup.mode to specific-offset.
scan.startup.specific-offset.skip-rows	The number of row changes to skip when reading from a specified offset. A single binary log event may correspond to multiple row changes.	No	INTEGER	None	When you use this parameter, you must set scan.startup.mode to specific-offset.
connect.timeout	The maximum time to wait for a connection to the MySQL database server to time out before retrying.	No	DURATION	30 s	None.
connect.max-retries	The maximum number of retries after a failed connection to the MySQL database service.	No	INTEGER	3	None.
connection.pool.size	The size of the database connection pool.	No	INTEGER	20	The database connection pool is used to reuse connections, which can reduce the number of database connections.
heartbeat.interval	The interval at which the source advances the binary log offset using heartbeat events.	No	DURATION	30s	Heartbeat events are used to advance the binary log offset in the source. This is very useful for tables in MySQL that are updated infrequently. For such tables, the binary log offset cannot advance automatically. Heartbeat events can push the binary log offset forward, which prevents issues caused by an expired binary log offset. An expired binary log offset can cause the job to fail and be unrecoverable, requiring a stateless restart.
rds.region-id	The region ID of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information about region IDs, see Regions and zones. Important Because the GTID string for MySQL CDC is randomly generated and not monotonically increasing like binary log file offsets, locating a GTID in a file requires downloading and parsing all archived logs from OSS. This process is very resource-intensive and time-consuming, making features that rely on GTID offsets infeasible. Therefore, the OSS archived log feature only supports starting from a specified timestamp or a specified binary log file offset. It does not support starting from a specified GTID, nor does it support scenarios with primary/secondary switchovers in the archived logs, because MySQL primary/secondary switchovers rely on GTIDs. Evaluate this feature carefully before use.
rds.access-key-id	The AccessKey ID of the Alibaba Cloud ApsaraDB RDS for MySQL account.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information, see How do I view the AccessKey ID and AccessKey secret? Important To prevent your AccessKey information from being leaked, use the secret management feature to specify the AccessKey ID. For more information, see Manage variables.
rds.access-key-secret	The AccessKey secret of the Alibaba Cloud ApsaraDB RDS for MySQL account.	Required when using the feature to read archived logs from OSS.	STRING	None	For more information, see How do I view the AccessKey ID and AccessKey secret? Important To prevent your AccessKey information from being leaked, use the secret management feature to specify the AccessKey secret. For more information, see Manage variables.
rds.db-instance-id	The ID of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	Required when using the feature to read archived logs from OSS.	STRING	None	None.
rds.main-db-id	The primary database number of the Alibaba Cloud ApsaraDB RDS for MySQL instance.	No	STRING	None	For more information about how to obtain the primary database number, see ApsaraDB RDS for MySQL log backup. Note If this parameter is not specified, VVR 11.7 and later automatically query the primary database number based on the ApsaraDB RDS for MySQL connection information.
rds.download.timeout	The timeout period for downloading a single archived log from OSS.	No	DURATION	60s	None.
rds.endpoint	The service endpoint for obtaining OSS binary log information.	No	STRING	None	For more information about the valid values, see Endpoints.
rds.binlog-directory-prefix	The directory prefix for storing binary log files.	No	STRING	rds-binlog-	None.
rds.use-intranet-link	Specifies whether to use the internal network to download binary log files.	No	BOOLEAN	true	None.
rds.binlog-directories-parent-path	The absolute path of the parent directory for storing binary log files.	No	STRING	None	None.
chunk-meta.group.size	The size of the chunk metadata.	No	INTEGER	1000	If the metadata is larger than this value, it is split into multiple parts for transmission.
chunk-key.even-distribution.factor.lower-bound	The lower bound of the chunk distribution factor for even sharding.	No	DOUBLE	0.05	If the distribution factor is less than this value, uneven sharding is used. Chunk distribution factor = (MAX(chunk-key) - MIN(chunk-key) + 1) / Total number of data rows.
chunk-key.even-distribution.factor.upper-bound	The upper bound of the chunk distribution factor for even sharding.	No	DOUBLE	1000.0	If the distribution factor is greater than this value, uneven sharding is used. Chunk distribution factor = (MAX(chunk-key) - MIN(chunk-key) + 1) / Total number of data rows.
scan.incremental.close-idle-reader.enabled	Specifies whether to close idle readers after the snapshot is complete.	No	BOOLEAN	false	For this configuration to take effect, you must set `execution.checkpointing.checkpoints-after-tasks-finish.enabled` to true.
scan.only.deserialize.captured.tables.changelog.enabled	In the incremental phase, specifies whether to deserialize only the change events of the specified tables.	No	BOOLEAN	The default value is false in VVR 8.x versions. The default value is true in VVR 11.1 and later.	Valid values: true: Deserializes only the change data of the target tables to accelerate binary log reading. false (default): Deserializes the change data of all tables.
scan.parallel-deserialize-changelog.enabled	In the incremental phase, specifies whether to use multiple threads to parse change events.	No	BOOLEAN	false	Valid values: true: Uses multiple threads in the change event deserialization phase while maintaining the order of binary log events to accelerate reading. false (default): Uses a single thread in the event deserialization phase. Note Supported only in VVR 8.0.11 and later.
scan.parallel-deserialize-changelog.handler.size	The number of event handlers when using multiple threads to parse change events.	No	INTEGER	2	Note Supported only in VVR 8.0.11 and later.
metadata-column.include-list	The metadata columns to be passed to the downstream.	No	STRING	None	The available metadata includes `op_ts`, `es_ts`, `query_log`, `file`, and `pos`. You can use commas to separate multiple metadata columns. Note The MySQL CDC YAML connector does not require or support adding database name, table name, and `op_type` metadata columns. You can directly use `__data_event_type__` in a Transform expression to get the change data type, or use `__schema_name__` and `__table_name__` to get the database name and table name. Important The `file` metadata column represents the binary log file where the data is located. It is "" during the full phase and the binary log filename during the incremental phase. The `pos` metadata column represents the offset of the data in the binary log file. It is "0" during the full phase and the data offset in the binary log file during the incremental phase. These two metadata columns are supported starting from VVR 11.5. The `es_ts` metadata column represents the start time of the corresponding transaction for the changelog on MySQL. It is supported only for MySQL 8.0.x. Do not add this metadata column when using earlier versions of MySQL. The `op_ts` timestamp is accurate to the second, while the `es_ts` timestamp is accurate to the millisecond.
scan.newly-added-table.enabled	When restarting from a checkpoint, specifies whether to synchronize newly added tables that were not matched during the previous startup or to remove tables from the state that are no longer matched.	No	BOOLEAN	false	This takes effect when restarting from a checkpoint or savepoint. Important During the full data reading phase, you cannot save a savepoint, add a new table to or delete a table from the source table, and then restart the job from the savepoint. This will cause the job to fail to read data.
scan.binlog.newly-added-table.enabled	In the incremental phase, specifies whether to send data from newly added tables that are matched.	No	BOOLEAN	false	Cannot be enabled at the same time as `scan.newly-added-table.enabled`.
scan.incremental.snapshot.chunk.key-column	Specifies a column for certain tables to be used as the splitting column for sharding during the snapshot phase.	No	STRING	None	Use a colon `:` to connect the table name and column name to define a rule. The table name can be a regular expression. You can define multiple rules by separating them with a semicolon `;`. For example: `db1.user_table_[0-9]+:col1;db[1-2].[app\|web]_order_\\.*:col2`. Required for tables without a primary key. The selected column must be of a non-null type (NOT NULL). Optional for tables with a primary key. Only one column can be selected from the primary key.
scan.parse.online.schema.changes.enabled	In the incremental phase, specifies whether to attempt to parse RDS lockless change DDL events.	No	BOOLEAN	false	Valid values: true: Parses RDS lockless change DDL events. false (default): Does not parse RDS lockless change DDL events. This is an experimental feature. Before performing an online lockless change, take a snapshot of the Flink job for recovery. Note Supported only in VVR 11.0 and later.
scan.incremental.snapshot.backfill.skip	Specifies whether to skip backfill during the snapshot reading phase.	No	BOOLEAN	false	Valid values: true: Skips backfill during the snapshot reading phase. false (default): Does not skip backfill during the snapshot reading phase. Backfill applies only during the snapshot query of a single chunk and does not cover the entire full-read phase. When backfill is skipped, each chunk's snapshot query reads the latest table data at that instant; updates that occur on a chunk after it has been read are not merged during the full-read phase and are read from the Binlog after entering the incremental phase. For example, an update to chunk5 that occurs while chunk5 is being snapshotted is reflected directly in chunk5's snapshot; if chunk5 is updated after the reader has advanced to chunk80, the update is applied later from the Binlog during the incremental phase. Important When enabled, changes that occur during or after a chunk's scan are still delivered from the Binlog in the incremental phase and may be duplicated. Only at-least-once semantics are guaranteed. Enable this only when the downstream sink supports idempotent writes by primary key. Note Supported only in VVR 11.1 and later.
treat-tinyint1-as-boolean.enabled	Specifies whether to treat the TINYINT(1) type as a Boolean type.	No	BOOLEAN	true	Valid values: true (default): Treats the TINYINT(1) type as a Boolean type. false: Does not treat the TINYINT(1) type as a Boolean type.
treat-timestamp-as-datetime-enabled	Specifies whether to treat the TIMESTAMP type as a DATETIME type.	No	BOOLEAN	false	Valid values: true: Treats the MySQL TIMESTAMP type as a DATETIME type and maps it to the CDC TIMESTAMP type. false (default): Maps the MySQL TIMESTAMP type to the CDC TIMESTAMP_LTZ type. The MySQL TIMESTAMP type stores UTC time and is affected by the time zone. The MySQL DATETIME type stores literal time and is not affected by the time zone. When enabled, this parameter converts MySQL TIMESTAMP type data to DATETIME type based on the server-time-zone.
include-comments.enabled	Specifies whether to synchronize table and column comments.	No	BOOELEAN	false	Valid values: true: Synchronizes table and column comments. false (default): Does not synchronize table and column comments. Enabling this option increases the memory usage of the job.
scan.incremental.snapshot.unbounded-chunk-first.enabled	Specifies whether to dispatch unbounded chunks first during the snapshot reading phase.	No	BOOELEAN	false	Valid values: true: Dispatches unbounded chunks first during the snapshot reading phase. false (default): Does not dispatch unbounded chunks first during the snapshot reading phase. This is an experimental feature. Enabling it can reduce the risk of OOM errors on the TaskManager when synchronizing the last chunk during the snapshot phase. Add this parameter before the first startup of the job. Note Supported only in VVR 11.1 and later.
binlog.session.network.timeout	The network timeout for the binary log connection.	No	DURATION	10m	If set to 0s, the default timeout of the MySQL server is used. Note Supported only in VVR 11.5 and later.
scan.rate-limit.records-per-second	Limits the maximum number of records sent by the source per second.	No	LONG	None	This is applicable to scenarios where data reading needs to be limited. This limit is effective in both the full and incremental phases. The `numRecordsOutPerSecond` metric of the source reflects the number of records output by the entire data stream per second. You can adjust this parameter based on this metric. In the full data reading phase, you usually need to reduce the number of rows read in each batch. You can reduce the value of the `scan.incremental.snapshot.chunk.size` parameter. Note Supported only in VVR 11.5 and later.
include-binlog-meta.enable	Specifies whether to include the original MySQL binary log information, such as GTID and binary log offset, in the message.	No	Boolean	false	This is applicable to original binary log synchronization scenarios, such as replacing an existing Canal synchronization link. Note Supported only in VVR 11.6 and later.
scan.binlog.tolerate.gtid-holes	Enabling this parameter ignores gaps in the GTID sequence, allowing the job to bypass discontinuous events and continue running.	No	Boolean	false	Before enabling this parameter, you must ensure that the job's start offset has not expired. If the job starts from a cleared or expired GTID offset, the engine will silently skip the missing logs, which will lead to data loss. Note This parameter is supported only in VVR 11.6 and later.
scan.emit.create-table-events.in-batch.enabled	Specifies whether to batch send table schemas during the job initialization phase.	No	Boolean	false	This is an experimental feature. Enable this option when a single job synchronizes many tables. Note This parameter is supported only in VVR 11.4 and later.

Reuse an existing catalog

Starting from VVR 11.5, you can directly reference a built-in MySQL catalog created on the Data Management page in a Flink CDC data ingestion job. This reduces the manual effort of writing connection properties.

source:
  type: mysql
  using.built-in-catalog: mysql_rds_catalog

Currently, data ingestion jobs support the automatic reuse of the following MySQL catalog parameters:

hostname
port
username
password
catalog.table.metadata-columns
catalog.table.treat-tinyint1-as-boolean

If you want to override any of these automatically reused parameters, you can explicitly write the corresponding YAML parameter. The explicitly written parameter has a higher priority.

Type mapping

The following table shows the data type mapping for data ingestion.

MySQL CDC field type	CDC field type
TINYINT(n)	TINYINT
SMALLINT	SMALLINT
TINYINT UNSIGNED
TINYINT UNSIGNED ZEROFILL
YEAR
INT	INT
MEDIUMINT
MEDIUMINT UNSIGNED
MEDIUMINT UNSIGNED ZEROFILL
SMALLINT UNSIGNED
SMALLINT UNSIGNED ZEROFILL
BIGINT	BIGINT
INT UNSIGNED
INT UNSIGNED ZEROFILL
BIGINT UNSIGNED	DECIMAL(20, 0)
BIGINT UNSIGNED ZEROFILL
SERIAL
FLOAT [UNSIGNED] [ZEROFILL]	FLOAT
DOUBLE [UNSIGNED] [ZEROFILL]	DOUBLE
DOUBLE PRECISION [UNSIGNED] [ZEROFILL]
REAL [UNSIGNED] [ZEROFILL]
NUMERIC(p, s) [UNSIGNED] [ZEROFILL] and p <= 38	DECIMAL(p, s)
DECIMAL(p, s) [UNSIGNED] [ZEROFILL] and p <= 38
FIXED(p, s) [UNSIGNED] [ZEROFILL] and p <= 38
BOOLEAN	BOOLEAN
BIT(1)
TINYINT(1)
DATE	DATE
TIME [(p)]	TIME [(p)]
DATETIME [(p)]	TIMESTAMP [(p)]
TIMESTAMP [(p)]	The mapping depends on the value of the `treat-timestamp-as-datetime-enabled` parameter: `true`:TIMESTAMP[(p)] `false`:TIMESTAMP_LTZ[(p)]
CHAR(n)	CHAR(n)
VARCHAR(n)	VARCHAR(n)
BIT(n)	BINARY(⌈(n + 7) / 8⌉)
BINARY(n)	BINARY(n)
VARBINARY(N)	VARBINARY(N)
NUMERIC(p, s) [UNSIGNED] [ZEROFILL] and 38 < p <= 65	STRING Note In MySQL, the decimal data type has a precision of up to 65, but in Flink, the precision is limited to 38. Therefore, if you define a decimal column with a precision greater than 38, you should map it to a string to avoid loss of precision.
DECIMAL(p, s) [UNSIGNED] [ZEROFILL] and 38 < p <= 65
FIXED(p, s) [UNSIGNED] [ZEROFILL] and 38 < p <= 65
TINYTEXT	STRING
TEXT
MEDIUMTEXT
LONGTEXT
ENUM
JSON	STRING Note The JSON data type is converted to a JSON-formatted string in Flink.
GEOMETRY	STRING Note Spatial data types in MySQL are converted to strings with a fixed JSON format. For more information, see MySQL Spatial Data Type Mapping.
POINT
LINESTRING
POLYGON
MULTIPOINT
MULTILINESTRING
MULTIPOLYGON
GEOMETRYCOLLECTION
TINYBLOB	BYTES Note For the BLOB data type in MySQL, only blobs with a length no greater than 2,147,483,647 (2**31-1) are supported.
BLOB
MEDIUMBLOB
LONGBLOB

Usage examples

CDC source table

CREATE TEMPORARY TABLE mysqlcdc_source (
   order_id INT,
   order_date TIMESTAMP(0),
   customer_name STRING,
   price DECIMAL(10, 5),
   product_id INT,
   order_status BOOLEAN,
   PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
  'connector' = 'mysql',
  'hostname' = '<yourHostname>',
  'port' = '3306',
  'username' = '<yourUsername>',
  'password' = '<yourPassword>',
  'database-name' = '<yourDatabaseName>',
  'table-name' = '<yourTableName>'
);

CREATE TEMPORARY TABLE blackhole_sink(
  order_id INT,
  customer_name STRING
) WITH (
  'connector' = 'blackhole'
);

INSERT INTO blackhole_sink
SELECT order_id, customer_name FROM mysqlcdc_source;

Dimension table

CREATE TEMPORARY TABLE datagen_source(
  a INT,
  b BIGINT,
  c STRING,
  `proctime` AS PROCTIME()
) WITH (
  'connector' = 'datagen'
);

CREATE TEMPORARY TABLE mysql_dim (
  a INT,
  b VARCHAR,
  c VARCHAR
) WITH (
  'connector' = 'mysql',
  'hostname' = '<yourHostname>',
  'port' = '3306',
  'username' = '<yourUsername>',
  'password' = '<yourPassword>',
  'database-name' = '<yourDatabaseName>',
  'table-name' = '<yourTableName>'
);

CREATE TEMPORARY TABLE blackhole_sink(
  a INT,
  b STRING
) WITH (
  'connector' = 'blackhole'
);

INSERT INTO blackhole_sink
SELECT T.a, H.b
FROM datagen_source AS T JOIN mysql_dim FOR SYSTEM_TIME AS OF T.`proctime` AS H ON T.a = H.a;

Sink table

CREATE TEMPORARY TABLE datagen_source (
  `name` VARCHAR,
  `age` INT
) WITH (
  'connector' = 'datagen'
);

CREATE TEMPORARY TABLE mysql_sink (
  `name` VARCHAR,
  `age` INT
) WITH (
  'connector' = 'mysql',
  'hostname' = '<yourHostname>',
  'port' = '3306',
  'username' = '<yourUsername>',
  'password' = '<yourPassword>',
  'database-name' = '<yourDatabaseName>',
  'table-name' = '<yourTableName>'
);

INSERT INTO mysql_sink
SELECT * FROM datagen_source;

Data ingestion data source

source:
  type: mysql
  name: MySQL Source
  hostname: ${mysql.hostname}
  port: ${mysql.port}
  username: ${mysql.username}
  password: ${mysql.password}
  tables: ${mysql.source.table}
  server-id: 7601-7604

sink:
  type: values
  name: Values Sink
  print.enabled: true
  sink.print.logger: true

About MySQL CDC source tables

How it works

When a MySQL CDC source table starts, it scans the entire table, splits the table into multiple chunks based on the primary key, and records the current binary log offset. The source table then uses an incremental snapshot algorithm to read the data from each chunk using SELECT statements. The job periodically performs checkpoints to record the completed chunks. If a failover occurs, the job continues to read data from the unfinished chunks. After all chunks are read, the job starts reading incremental change records from the previously recorded binary log offset. The Flink job continues to perform periodic checkpoints to record the binary log offset. If the job fails over, it resumes processing from the last recorded binary log offset, which achieves exactly-once semantics.

For a more detailed explanation of the incremental snapshot algorithm, see MySQL CDC Connector.

Metadata

Metadata is useful in scenarios where data from sharded databases and tables is merged and synchronized. This is because after merging, businesses often want to distinguish the source database and table for each data record. Metadata columns can access the database and table name information of the source table. Therefore, you can easily merge multiple sharded tables into a single destination table using metadata columns.

The MySQL CDC Source supports metadata column syntax. You can access the following metadata through metadata columns.

Metadata key	Metadata type	Description
database_name	STRING NOT NULL	The name of the database that contains the row.
table_name	STRING NOT NULL	The name of the table that contains the row.
op_ts	TIMESTAMP_LTZ(3) NOT NULL	The time the row was changed in the database. If the record is from the historical data of the table instead of the binary log, this value is always 0. Note This field is accurate only to the second.
op_type	STRING NOT NULL	The change type of the row. +I: INSERT message -D: DELETE message -U: UPDATE_BEFORE message +U: UPDATE_AFTER message Note Supported only in VVR 8.0.7 and later.
query_log	STRING NOT NULL	You can read the MySQL query log record for this row. Note MySQL needs to have the binlog_rows_query_log_events parameter enabled to record query logs.

The following code example shows how to merge and synchronize multiple orders tables from multiple sharded databases in a MySQL instance to a holo_orders table in Hologres.

CREATE TEMPORARY TABLE mysql_orders (
  db_name STRING METADATA FROM 'database_name' VIRTUAL,  -- Read the database name.
  table_name STRING METADATA  FROM 'table_name' VIRTUAL, -- Read the table name.
  operation_ts TIMESTAMP_LTZ(3) METADATA FROM 'op_ts' VIRTUAL, -- Read the change time.
  op_type STRING METADATA FROM 'op_type' VIRTUAL, -- Read the change type.
  order_id INT,
  order_date TIMESTAMP(0),
  customer_name STRING,
  price DECIMAL(10, 5),
  product_id INT,
  order_status BOOLEAN,
  PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
  'connector' = 'mysql-cdc',
  'hostname' = 'localhost',
  'port' = '3306',
  'username' = 'flinkuser',
  'password' = 'flinkpw',
  'database-name' = 'mydb_.*', -- Regular expression to match multiple sharded databases.
  'table-name' = 'orders_.*'   -- Regular expression to match multiple sharded tables.
);

INSERT INTO holo_orders SELECT * FROM mysql_orders;

Based on the code above, if the `scan.read-changelog-as-append-only.enabled` parameter is set to true in the WITH clause, the output result varies depending on the primary key setting of the downstream table:

If the primary key of the downstream table is `order_id`, the output result contains only the last change for each primary key in the upstream table. For data whose last change for a primary key was a delete operation, you can see a record in the downstream table with the same primary key and an `op_type` of -D.
If the primary key of the downstream table is `order_id`, `operation_ts`, and `op_type`, the output result contains the complete changes for each primary key in the upstream table.

Regular expression support
The MySQL CDC source table supports using regular expressions in the table name or database name to match multiple tables or databases. The following code example shows how to specify multiple tables using a regular expression.
```
CREATE TABLE products (
  db_name STRING METADATA FROM 'database_name' VIRTUAL,
  table_name STRING METADATA  FROM 'table_name' VIRTUAL,
  operation_ts TIMESTAMP_LTZ(3) METADATA FROM 'op_ts' VIRTUAL,
  order_id INT,
  order_date TIMESTAMP(0),
  customer_name STRING,
  price DECIMAL(10, 5),
  product_id INT,
  order_status BOOLEAN,
  PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
  'connector' = 'mysql-cdc',
  'hostname' = 'localhost',
  'port' = '3306',
  'username' = 'root',
  'password' = '123456',
  'database-name' = '(^(test).*|^(tpc).*|txc|.*[p$]|t{2})', -- Regular expression to match multiple databases.
  'table-name' = '(t[5-8]|tt)' -- Regular expression to match multiple tables.
);
```
The regular expressions in the example are explained as follows:
- `^(test).*` is a prefix matching example. This expression can match database names that start with "test", such as "test1" or "test2".
- `.*[p$]` is a suffix matching example. This expression can match database names that end with "p", such as "cdcp" or "edcp".
- `txc` is a specific match. It can match a database name that is exactly "txc".
When MySQL CDC matches a fully qualified table name, it uses the `database-name.table-name` pattern to uniquely identify a table. For example, the pattern `(^(test).*|^(tpc).*|txc|.*[p$]|t{2}).(t[ 5-8]|tt)` can match tables such as `txc.tt` and `test2.test5` in the database.
Important
In the configuration of an SQL job, the `table-name` and `database-name` parameters do not support using a comma (,) to specify multiple tables or databases.
- To match multiple tables or use multiple regular expressions, connect them with a vertical bar (|) and enclose them in parentheses. For example, to read the `user` and `product` tables, you can set `table-name` to (user|product).
- If a regular expression contains a comma, you must rewrite it using the vertical bar (|) operator. For example, the regular expression mytable_\d{1, 2} must be rewritten as the equivalent (mytable_\d{1}|mytable_\d{2}) to avoid using a comma.
Concurrency control

The MySQL connector supports multithreaded reading of full data, which can improve data loading efficiency. In conjunction with the Autopilot automatic tuning feature in the Realtime Compute for Apache Flink console, the connector can automatically scale in during the incremental phase after multithreaded reading is complete to save compute resources.
In the development console of Realtime Compute for Apache Flink, you can set the concurrency of a job in basic mode or expert mode on the Resource Configuration page.
- The concurrency set in basic mode is the global concurrency for the entire job.
  
  For example, when the parallelism is set to 8 in basic mode, the server-id in the SQL WITH clause should be configured as a continuous range (such as '404-412').
- Expert mode supports setting the concurrency for a specific VERTEX as needed.
For more information about resource configuration, see Configure deployment information for a job.

Important
Whether you are in basic mode or expert mode, when you set the concurrency, the server ID range declared in the table must be greater than or equal to the job's concurrency. For example, if the server ID range is `5404-5412`, there are nine unique server IDs. Therefore, the job's concurrency can be set to a maximum of 9. Different jobs for the same MySQL instance must not have overlapping server ID ranges. This means each job must be explicitly configured with a different server ID or server ID range.
Autopilot automatic scale-in

The full data phase accumulates a large amount of historical data. To improve reading efficiency, historical data is usually read in parallel. In the incremental binary log phase, because the amount of binary log data is small and to ensure global order, single-threaded reading is usually sufficient. The different resource requirements of the full and incremental phases can be balanced for performance and resources using the automatic tuning feature.

Automatic tuning monitors the traffic of each task of the MySQL CDC Source. When entering the binary log phase, if only one task is responsible for binary log reading and the other tasks are idle, automatic tuning automatically reduces the CU count and concurrency of the source. To enable automatic tuning, set the automatic tuning mode to Active on the job O&M page.

Note
The default minimum trigger interval for reducing concurrency is 24 hours. For more information about automatic tuning parameters and details, see Configure automatic tuning.
Startup modes

Use the `scan.startup.mode` configuration item to specify the startup mode of the MySQL CDC source table. The options include the following:
- initial (default): On the first startup, performs a full read of the database table and then switches to incremental mode to read the binary log.
- earliest-offset: Skips the snapshot phase and starts reading from the earliest available binary log offset.
- latest-offset: Skips the snapshot phase and starts reading from the end of the binary log. In this mode, the source table can only read data changes that occur after the job starts.
- specific-offset: Skips the snapshot phase and starts reading from a specified binary log offset. The offset can be specified by the binary log filename and position, or by a GTID set.
- timestamp: Skips the snapshot phase and starts reading binary log events from a specified timestamp.
Usage example:
```
CREATE TABLE mysql_source (...) WITH (
    'connector' = 'mysql-cdc',
    'scan.startup.mode' = 'earliest-offset', -- Start from the earliest offset.
    'scan.startup.mode' = 'latest-offset', -- Start from the latest offset.
    'scan.startup.mode' = 'specific-offset', -- Start from a specific offset.
    'scan.startup.mode' = 'timestamp', -- Start from a specific timestamp.
    'scan.startup.specific-offset.file' = 'mysql-bin.000003', -- Specify the binary log filename in specific-offset mode.
    'scan.startup.specific-offset.pos' = '4', -- Specify the binary log position in specific-offset mode.
    'scan.startup.specific-offset.gtid-set' = '24DA167-0C0C-11E8-8442-00059A3C7B00:1-19', -- Specify the GTID set in specific-offset mode.
    'scan.startup.timestamp-millis' = '1667232000000' -- Specify the startup timestamp in timestamp mode.
    ...
)
```
Important
- The MySQL source prints the current offset to the log at the INFO level during a checkpoint. The log prefix is Binlog offset on checkpoint {checkpoint-id}. This log can help you start a job from a specific checkpoint offset.
- If the table being read has undergone schema changes, starting from the `earliest-offset`, `specific-offset`, or `timestamp` may cause an error. This is because the Debezium reader internally saves the latest table schema, and early data with a mismatched schema cannot be parsed correctly.

About CDC source tables without primary keys
- Using a table without a primary key requires setting `scan.incremental.snapshot.chunk.key-column`, and only a non-null column can be selected.
- The processing semantics for a CDC source table without a primary key are determined by the behavior of the column specified by `scan.incremental.snapshot.chunk.key-column`:
  - If the specified column is not updated, exactly-once semantics can be guaranteed.
  - If the specified column is updated, only at-least-once semantics can be guaranteed. However, you can ensure data correctness by combining it with the downstream, specifying a downstream primary key, and using idempotent operations.
Read backup logs of Alibaba Cloud ApsaraDB RDS for MySQL

The MySQL CDC source table supports reading backup logs of Alibaba Cloud ApsaraDB RDS for MySQL. This is useful in scenarios where the full data phase takes a long time and the local binary log files have been automatically cleared, but the automatically or manually uploaded backup files still exist.

Usage example:
```
CREATE TABLE mysql_source (...) WITH (
    'connector' = 'mysql-cdc',
    'rds.region-id' = 'cn-beijing',
    'rds.access-key-id' = 'xxxxxxxxx', 
    'rds.access-key-secret' = 'xxxxxxxxx', 
    'rds.db-instance-id' = 'rm-xxxxxxxxxxxxxxxxx', 
    'rds.main-db-id' = '12345678',
    'rds.download.timeout' = '60s'
    ...
)
```
Enable CDC Source reuse

In the same job, multiple MySQL CDC source tables start multiple binary log clients. If all source tables are on the same instance, this increases the load on the database. For more information, see MySQL CDC FAQ.

Solution

VVR 8.0.7 and later versions support MySQL CDC source reuse. This feature merges MySQL CDC source tables that can be merged. Merging occurs when the source table configurations are identical, except for the database name, table name, and server-id. The engine automatically merges MySQL CDC sources within the same job.

Procedure
1. Use the SET command in your SQL job:
```
SET 'table.optimizer.source-merge.enabled' = 'true';

# (For VVR 8.0.8 and 8.0.9) Also set this item:
SET 'sql-gateway.exec-plan.enabled' = 'false';
```
  VVR 11.1 and later versions have reuse enabled by default.
2. Start the job without a state. Because modifying the source reuse configuration changes the job topology, you must start the job without a state. Otherwise, the job may fail to start or you may lose data. If a source is merged, you can see a MergetableSourceScan node in the topology.
Important
- After you enable reuse, do not disable operator chaining. If you set pipeline.operator-chaining to false, it increases the overhead of data serialization and deserialization. The more sources are merged, the greater the overhead.
- In VVR 8.0.7, disabling operator chaining causes serialization issues.

Accelerate binary log reading

When you use the MySQL connector as a source table or a data ingestion data source, it parses binary log files to generate various change messages during the incremental phase. The binary log files record all table changes in binary format. You can accelerate the parsing of binary log files in the following ways.

Enable parsing filter configuration
- Use the scan.only.deserialize.captured.tables.changelog.enabled configuration item to parse only the change events of specified tables.
Optimize Debezium parameters
```
debezium.max.queue.size: 162580
debezium.max.batch.size: 40960
debezium.poll.interval.ms: 50
```
- debezium.max.queue.size: The maximum number of records that the blocking queue can hold. When Debezium reads an event stream from the database, it places the events in a blocking queue before writing them downstream. The default value is 8192.
- debezium.max.batch.size: The maximum number of events that the connector processes in each iteration. The default value is 2048.
- debezium.poll.interval.ms: The number of milliseconds the connector should wait before it requests new change events. The default value is 1000 milliseconds, or 1 second.

Usage example:

CREATE TABLE mysql_source (...) WITH (
    'connector' = 'mysql-cdc',
    -- Debezium configuration
    'debezium.max.queue.size' = '162580',
    'debezium.max.batch.size' = '40960',
    'debezium.poll.interval.ms' = '50',
    -- Enable parsing filter
    'scan.only.deserialize.captured.tables.changelog.enabled' = 'true',  -- Parse only the change events of specified tables.
    ...
)

source:
  type: mysql
  name: MySQL Source
  hostname: ${mysql.hostname}
  port: ${mysql.port}
  username: ${mysql.username}
  password: ${mysql.password}
  tables: ${mysql.source.table}
  server-id: 7601-7604
  # Debezium configuration
  debezium.max.queue.size: 162580
  debezium.max.batch.size: 40960
  debezium.poll.interval.ms: 50
  # Enable parsing filter
  scan.only.deserialize.captured.tables.changelog.enabled: true

The binary log consumption capacity of the MySQL CDC Enterprise Edition is 85 MB/s, which is about twice that of the open source community version. When the generation speed of binary log files exceeds 85 MB/s (that is, one 512 MB file every 6 seconds), the latency of the Flink job continues to increase. The processing latency gradually decreases after the binary log file generation speed slows down. If a binary log file contains a large transaction, the processing latency may temporarily increase. The latency decreases after the log for that transaction is read.

MySQL CDC DataStream API

Important

When you read and write data through the DataStream API, you need to use the corresponding DataStream connector to connect to Flink. For more information about how to set up the DataStream connector, see How to use the DataStream connector.

You can create a DataStream API program and use MySqlSource. The following code and pom dependency examples are provided:

Java

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
public class MySqlSourceExample {
  public static void main(String[] args) throws Exception {
    MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
        .hostname("yourHostname")
        .port(yourPort)
        .databaseList("yourDatabaseName") // set captured database
        .tableList("yourDatabaseName.yourTableName") // set captured table
        .username("yourUsername")
        .password("yourPassword")
        .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
        .build();
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // enable checkpoint
    env.enableCheckpointing(3000);
    env
      .fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
      // set 4 parallel source tasks
      .setParallelism(4)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering
    env.execute("Print MySQL Snapshot + Binlog");
  }
}

XML

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-core</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-base</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-common</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-java-bridge</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.alibaba.ververica</groupId>
    <artifactId>ververica-connector-mysql</artifactId>
    <version>${vvr.version}</version>
</dependency>

When you build MySqlSource, you must specify the following parameters in the code:

Parameter	Description
hostname	The IP address or hostname of the MySQL database.
port	The port number of the MySQL database service.
databaseList	The name of the MySQL database. Note This parameter supports regular expressions to read data from multiple databases. You can use `.*` to match all databases.
username	The username for the MySQL database service.
password	The password for the MySQL database service.
deserializer	The deserializer that deserializes SourceRecord type records to a specified type. Valid values: RowDataDebeziumDeserializeSchema: Converts SourceRecord to the Flink Table or SQL internal data structure RowData. JsonDebeziumDeserializationSchema: Converts SourceRecord to a JSON-formatted string.

The pom dependencies must specify the following parameters:

${vvr.version}	The engine version of Alibaba Cloud Realtime Compute for Apache Flink, for example: `1.17-vvr-8.0.4-3`. Note Use the version number displayed on Maven, as we may release hotfix versions without notifying through other channels.
${flink.version}	The Apache Flink version, for example: `1.17.2`. Important Use the Apache Flink version that corresponds to the engine version of Alibaba Cloud Realtime Compute for Apache Flink to avoid incompatibility issues during job runtime. For more information about the version mapping, see Engine versions.

FAQ

For more information about problems you may encounter when using CDC source tables, see CDC FAQ.

Flink CDC technical principles and Enterprise Edition features

Flink CDC Enterprise Edition features
Flink CDC technology

Background information

Features

Prerequisites

ApsaraDB RDS for MySQL

PolarDB for MySQL

Self-managed MySQL

Limits

General limits

ApsaraDB RDS for MySQL limits

PolarDB for MySQL limits

Open source MySQL limits

Usage notes

Purpose of Server ID

Server ID configuration for different scenarios

SQL

Syntax

WITH parameters

Type mapping

Data ingestion

Syntax

Configuration items

Reuse an existing catalog

Type mapping

Usage examples

About MySQL CDC source tables

Accelerate binary log reading

MySQL CDC DataStream API

FAQ

Flink CDC technical principles and Enterprise Edition features