Change Data Capture Architecture Using Debezium, Postgres, and Kafka
Change Data Capture Architecture Using Debezium, Postgres, and Kafka
Hands-on tutorial on how to setup Change Data Capture using Kafka and PostrgeSQL.
Join the DZone community and get the full member experience.Join For Free
Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events - for example, logical decoding in PostgreSQL, MySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.
Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform - a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.
This is different compared to the "polling" technique adopted by the Kafka Connect JDBC connector
The diagram (from the debezium.io website) summarises it nicely!
This blog is a guide to getting started with setting up a change data capture based system on Azure using Debezium, Azure DB for PostgreSQL and Azure Event Hubs (for Kafka). It will use the Debezium PostgreSQL connector to stream database modifications from PostgreSQL to Kafka topics in Azure Event Hubs
The related config files are available in the GitHub repo https://github.com/abhirockzz/
Although I have used managed Azure services for demonstration purposes these instructions should work for any other setup as well e.g. a local Kafka cluster and PostgreSQL instance.
Azure DB for PostgreSQL is a managed, relational database service based on the community version of open-source PostgreSQL database engine, and is available in two deployment modes.
At the time of writing, it supports PostgreSQL version
You can setup PostgreSQL on Azure using a variety of options including, the Azure Portal, Azure CLI, Azure PowerShell, ARM template. Once you've done that, you can easily connect to the database using you favourite programming language such as Java, .NET, Node.js, Python, Go, etc.
Although the above references are for Single Server deployment mode, please note that Hyperscale (Citus) is another deployment mode you can use for "workloads that are approaching -- or already exceed -- 100 GB of data."
Please ensure that you keep the following PostgreSQL related information handy since you will need them to configure the Debezium Connector in the subsequent sections - database hostname (and port), username, password
Azure Event Hubs is a fully managed data streaming platform and event ingestion service. It also provides a Kafka endpoint that supports Apache Kafka protocol 1.0 and later and works with existing Kafka client applications and other tools in the Kafka ecosystem including
Kafka Connect (demonstrated in this blog).
You can use the Azure Portal, Azure CLI, PowerShell or ARM template to create an Azure Event Hubs namespace and other resources. To ensure that the Kafka functionality is enabled, all you need to do is choose the
Dedicated tier (since the Basic tier doesn't support Kafka on Event Hubs.)
To run Kafka Connect, I will be using a local Kafka installation just for convenience. Just download Apache Kafka, unzip its contents and you're good to go!
To start with, clone this Git repo:
Download Debezium PostgreSQL source connector JARs
1.2.0is the latest version at the time of writing
You should now see a new folder named
debezium-connector-postgres. Copy the connector JAR files to your Kafka installation:
Before starting the Kafka Connect cluster, edit the
connect.properties file to include appropriate values for the following attributes:
consumer.sasl.jaas.config (just replace the placeholders)
Start Kafka Connect cluster (I am running it in
Wait for the Kafka Connect instance to start - you should see Kafka Connect internal topics in Azure Event Hubs e.g.
Before installing the connector, we need to:
- Ensure that the PostgreSQL instance is accessible from your Kafka Connect cluster
- Ensure that the PostrgeSQL replication setting is set to "Logical"
- Create a table which you can use to try out the change data capture feature
If you're using Azure DB for PostgreSQL, create a firewall rule using az postgres server firewall-rule create command to whitelist your Kafka Connect host. In my case, it was a local Kafka Connect cluster, so I simply navigated to the Azure portal (Connection security section of my PostrgreSQL instance) and chose Add current client IP address to make sure that my local IP was added to the firewall rule as such:
To change the replication mode for Azure DB for PostgreSQL, you can use the az postgres server configuration command:
.. or use the Replication menu of your PostgreSQL instance in the Azure Portal:
After updating the configuration, you will need to re-start the server which you can do using the CLI (az postgres server restart) or the portal.
Once the database is up and running, create the table - I have used
psql CLI in this example, but feel free to use any other tool. For example, to connect to your PostgreSQL database on Azure over SSL (you will be prompted for the password):
Install Debezium PostgreSQL Source Connector
pg-source-connector.json file with the details for the Azure PostgreSQL instance. Here is an example:
Let's go through the configuration:
For detailed info, check Debezium documentation
connector.class: name of the connector class (this is a static value)
database.port: IP address or hostname for your PostgreSQL instance as well as the port (e.g.
database.password: username and password for your PostgreSQL instance
database.dbname: database name e.g.
database.server.name: Logical name that identifies and provides a namespace for the particular PostgreSQL database server/cluster being monitored.
table.whitelist: comma-separated list of regex specifying which tables you want to monitor for change data capture
plugin.name: name of the logical decoding plug-in e.g.
At the time of writing, Debezium supports the following plugins:
pgoutput. I have used
wal2jsonin this example, and it's supported on Azure as well!
Finally, install the connector!
Kafka Connect will now start monitoring the
todos table for create, update and delete events
The connector should now spring into action and send the CDC events to a Event Hubs topic named
<server name in config>.<table name> e.g.
Let's introspect the contents of the topic to make sure everything is working as expected. I am using
kafkacat in this example, but you can also create a consumer app using any of these options listed here
sasl.password attributes in
kafkacat.conf to include Kafka broker details. In a different terminal, use it to read the CDC payloads:
You should see the JSON payloads representing the change data events generated in PostgreSQL in response to the rows you had just added to the
todos table. Here is a snippet of the payload:
The event consists of the
payload along with its
schema (omitted for brevity). In
payload section, notice how the create operation (
"op": "c") is represented -
"before": null means that this was a newly
after provides values for the each columns in the row,
source provides the PostgreSQL instance metadata from where this event was picked up etc.
You can try the same with update or delete operations as well and introspect the CDC events, e.g.
(Optional) Install File Sink Connector
As bonus, you can quickly test this with a File Sink connector as well. It is available out of the box in the Kafka distribution - all you need to do is install the connector. Just replace the
file attribute in
To create the connector:
Play around with the database records and monitor the records in the configured output sink file, e.g.
If you've reached this far, thanks for reading (this rather lengthy tutorial)!
Change Data Capture is a powerful technique that can help "unlock the database" by providing near real-time access to it's changes. This was a "getting started" guide meant to help you get up and running quickly, experiment with and iterate further. Hope you found it useful!
Published at DZone with permission of Abhishek Gupta , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.