Change Data Capture (CDC) With Embedded Debezium and Spring Boot
Join the DZone community and get the full member experience.
Join For FreeWhile working with data or replicating data sources, you probably have heard the term Change Data Capture (CDC). As the name suggests, “CDC” is a design pattern that continuously identifies and captures incremental changes to data. This pattern is used for real-time data replication across live databases to analytical data sources or read replicas. It can also be used to trigger events based on data changes, such as the OutBox pattern.
Most modern databases support CDC through transaction logs. A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file.
In this blog, I wanted to focus on using a framework commonly used for CDC, and embedding it with Spring Boot.
You may also like: Kafka Connectors Without Kafka.
What Is Debezium?
Debezium is a distributed platform built for CDC. It uses database transaction logs and creates event streams on row-level changes. Applications listening to these events can perform needed actions based on incremental data changes.
Debezium provides a library of connectors, supporting a variety of databases available today. These connectors can monitor and record row-level changes in the schemas of a database. They then publish the changes on to a streaming service like Kafka.
Normally, one or more connectors are deployed into a Kafka Connect cluster and are configured to monitor databases and publish data-change events to Kafka. A distributed Kafka Connect cluster provides the fault tolerance and scalability needed, ensuring that all the configured connectors are always running.
What Is Embedded Debezium?
Applications that don’t need the level of fault tolerance and reliability Kafka Connect offers or want to minimize the cost of using them to run the entire platform, can run Debezium connectors within the application. This is done by embedding the Debezium engine and configuring the connector to run within the application. On data change events, the connectors send them directly to the application.
Running Debezium With Spring Boot
Keeping the example simple, let’s have a Spring Boot application, "Student CDC Relay," running embedded Debezium and tailing the transaction logs of a Postgres database that houses the “Student” table. The Debezium connector configured within the Spring Boot application invokes a method within the application when a database operation like Insert/Update/Delete are made on the “Student” table. The method acts on these events and syncs the data within the Student index on ElasticSearch.
Installing Tools
All the needed tools can be installed running the docker-compose file below. This starts the Postgres database on port 5432 and Elastic Search on port 9200(HTTP) and 9300(Transport).
version: "3.5"
services:
# Install postgres and setup the student database.
postgres:
container_name: postgres
image: debezium/postgres
ports:
- 5432:5432
environment:
- POSTGRES_DB=studentdb
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
# Install Elasticsearch.
elasticsearch:
container_name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:6.8.0
environment:
- discovery.type=single-node
ports:
- 9200:9200
- 9300:9300
We use the image debezium/postgres
because it comes prebuilt with the logical decoding feature. This is a mechanism that allows the extraction of changes that were committed to the transaction log, making CDC possible. The documentation for installing the plugin to Postgres can be found here.
Understanding the Code
The first step is to define the Maven dependencies in the pom.xml
for debezium-embedded
and debezium-connector
. The sample reads the changes from Postgres, so we use the Postgres connector.
<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-embedded</artifactId>
<version>${debezium.version}</version>
</dependency>
<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-connector-postgres</artifactId>
<version>${debezium.version}</version>
</dependency>
Then, we configure the connector, which listens to changes on the Student table. We use the class PostgresConnector
for the connector.class
setting, which is provided by Debezium. This is the name of the Java class for the connector, which tails the source database.
The connector also takes an important setting, offset.storage
, which helps the application keep track of how much it has processed from the transaction log. Should the application fail while processing, it can resume reading the changes from the point it failed after restart.
There are multiple ways of storing offsets, but in this example we use the class FileOffsetBackingStore
to store offsets in a local file defined by offset.storage.file.filename
. The connector records the offsets within the file, and, for every change it reads, the Debezium engine flushes the offsets to the file periodically based on setting offset.flush.interval.ms
.
The other parameters to the connector are the Postgres database properties which house the Student table.
@Bean
public io.debezium.config.Configuration studentConnector() {
return io.debezium.config.Configuration.create()
.with("connector.class", "io.debezium.connector.postgresql.PostgresConnector")
.with("offset.storage", "org.apache.kafka.connect.storage.FileOffsetBackingStore")
.with("offset.storage.file.filename", "/path/cdc/offset/student-offset.dat")
.with("offset.flush.interval.ms", 60000)
.with("name", "student-postgres-connector")
.with("database.server.name", studentDBHost+"-"+studentDBName)
.with("database.hostname", studentDBHost)
.with("database.port", studentDBPort)
.with("database.user", studentDBUserName)
.with("database.password", studentDBPassword)
.with("database.dbname", studentDBName)
.with("table.whitelist", STUDENT_TABLE_NAME).build();
}
The final change to setup embedded Debezium is to start it when the application starts up. For this, we use the class EmbeddedEngine
, which acts as a wrapper for the connector and manages the connectors lifecycle. The engine is created using the connector configuration and a function that it will call for every data change event — in our example the method handleEvent()
.
private CDCListener(Configuration studentConnector, StudentService studentService) {
this.engine = EmbeddedEngine
.create()
.using(studentConnector)
.notifying(this::handleEvent).build();
this.studentService = studentService;
}
On the handleEvent()
we parse every event, identify which operation took place, and invoke the StudentService
to perform the Create/Update/Delete operations on Elastic Search using Spring Data JPA for Elasticsearch.
Now that we have setup the EmbeddedEngine
, we can start it asynchronously using the service Executor
.
private final Executor executor = Executors.newSingleThreadExecutor();
@PostConstruct
private void start() {
this.executor.execute(engine);
}
@PreDestroy
private void stop() {
if (this.engine != null) {
this.engine.stop();
}
}
Seeing the Code in Action
We then start all the required tools by running the docker-compose file, using the command docker-compose up -d
and starting the ‘Student CDC Relay’ using the command mvn spring-boot:run
. We can set up the Student table by running the following script:
CREATE TABLE public.student
(
id integer NOT NULL,
address character varying(255),
email character varying(255),
name character varying(255),
CONSTRAINT student_pkey PRIMARY KEY (id)
);
To see the code in action, we make data changes on the table we just created.
Inserting a Record to the Student Table
We can run the following SQL statement to insert a record in to the Student table in our Postgres database.
INSERT INTO STUDENT(ID, NAME, ADDRESS, EMAIL) VALUES('1','Jack','Dallas, TX','jack@gmail.com');
We can verify that a record was created on ElasticSearch.
$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
"_index" : "student",
"_type" : "student",
"_id" : "1",
"_version" : 31,
"_seq_no" : 30,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 1,
"name" : "Jack",
"address" : "Dallas, TX",
"email" : "jack@gmail.com"
}
}
Updating a Record in the Student Table
We can run the following SQL statement to update a record on the Student table in our Postgres database.
UPDATE STUDENT SET EMAIL='jill@gmail.com', NAME='Jill' WHERE ID = 1;
We can verify that the data has been changed to ‘Jill’ on ElasticSearch.
$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
"_index" : "student",
"_type" : "student",
"_id" : "1",
"_version" : 32,
"_seq_no" : 31,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 1,
"name" : "Jill",
"address" : "Dallas, TX",
"email" : "jill@gmail.com"
}
}
Deleting a Record in the Student Table
We can run the following SQL statement to delete a record from the Student table in our Postgres database.
DELETE FROM STUDENT WHERE ID = 1;
We can verify that the data has been deleted with ElasticSearch.
$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
"_index" : "student",
"_type" : "student",
"_id" : "1",
"_version" : 33,
"_seq_no" : 32,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 1,
"name" : null,
"address" : null,
"email" : null
}
}
Final Thoughts
This approach is indeed far simpler with few moving parts, but it is more limited in terms of scaling and far less tolerant of failures.
The source record will be handled exactly once when the CDC-Relay application is running fine. The underlying applications do need to be tolerant of receiving duplicate events following a restart of the CDC-Relay application.
We can test the limitations around scaling by starting another instance of the ‘Student CDC Relay’ [on another port]. We see the below exception:
ERROR 59453 --- [pool-2-thread-1] io.debezium.embedded.EmbeddedEngine : Error while trying to run connector class 'io.debezium.connector.postgresql.PostgresConnector'
Caused by: org.postgresql.util.PSQLException: ERROR: replication slot "debezium" is active for PID <>
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1116) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.QueryExecutorImpl.startCopy(QueryExecutorImpl.java:842) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.replication.V3ReplicationProtocol.initializeReplication(V3ReplicationProtocol.java:58) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.replication.V3ReplicationProtocol.startLogical(V3ReplicationProtocol.java:42) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.replication.fluent.ReplicationStreamBuilder$1.start(ReplicationStreamBuilder.java:38) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.replication.fluent.logical.LogicalStreamBuilder.start(LogicalStreamBuilder.java:37) ~[postgresql-42.2.5.jar:42.2.5]
If your application needs at-least-once delivery guarantees of all messages, it would be better to use the the full distributed Debezium system with Kafka-Connect.
Further Reading
Opinions expressed by DZone contributors are their own.
Comments