DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Improving Backend Performance Part 1/3: Lazy Loading in Vaadin Apps
  • Spring Data: Data Auditing Using JaVers and MongoDB
  • Using PostgreSQL pgoutput Plugin for Change Data Capture With Debezium
  • Spring Boot Application With Kafka, Elasticsearch, Redis With Enterprise Standards Part 1

Trending

  • Understanding the Shift: Why Companies Are Migrating From MongoDB to Aerospike Database?
  • Modern Test Automation With AI (LLM) and Playwright MCP
  • The Perfection Trap: Rethinking Parkinson's Law for Modern Engineering Teams
  • Advancing Robot Vision and Control
  1. DZone
  2. Data Engineering
  3. Databases
  4. Change Data Capture (CDC) With Embedded Debezium and Spring Boot

Change Data Capture (CDC) With Embedded Debezium and Spring Boot

By 
Sohan Ganapathy user avatar
Sohan Ganapathy
·
Dec. 03, 19 · Tutorial
Likes (11)
Comment
Save
Tweet
Share
33.5K Views

Join the DZone community and get the full member experience.

Join For Free

heart-made-of-binary

While working with data or replicating data sources, you probably have heard the term Change Data Capture (CDC). As the name suggests, “CDC” is a design pattern that continuously identifies and captures incremental changes to data. This pattern is used for real-time data replication across live databases to analytical data sources or read replicas. It can also be used to trigger events based on data changes, such as the OutBox pattern.

Most modern databases support CDC through transaction logs. A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file.

In this blog, I wanted to focus on using a framework commonly used for CDC, and embedding it with Spring Boot.

You may also like: Kafka Connectors Without Kafka.

What Is Debezium?

Debezium is a distributed platform built for CDC. It uses database transaction logs and creates event streams on row-level changes. Applications listening to these events can perform needed actions based on incremental data changes.

Debezium provides a library of connectors, supporting a variety of databases available today. These connectors can monitor and record row-level changes in the schemas of a database. They then publish the changes on to a streaming service like Kafka.

Normally, one or more connectors are deployed into a Kafka Connect cluster and are configured to monitor databases and publish data-change events to Kafka. A distributed Kafka Connect cluster provides the fault tolerance and scalability needed, ensuring that all the configured connectors are always running.

What Is Embedded Debezium?

Applications that don’t need the level of fault tolerance and reliability Kafka Connect offers or want to minimize the cost of using them to run the entire platform, can run Debezium connectors within the application. This is done by embedding the Debezium engine and configuring the connector to run within the application. On data change events, the connectors send them directly to the application.

Running Debezium With Spring Boot

Keeping the example simple, let’s have a Spring Boot application, "Student CDC Relay," running embedded Debezium and tailing the transaction logs of a Postgres database that houses the “Student” table. The Debezium connector configured within the Spring Boot application invokes a method within the application when a database operation like Insert/Update/Delete are made on the “Student” table. The method acts on these events and syncs the data within the Student index on ElasticSearch.

Design of the example being showcased
The code for the sample can be found here.

Installing Tools

All the needed tools can be installed running the docker-compose file below. This starts the Postgres database on port 5432 and Elastic Search on port 9200(HTTP) and 9300(Transport).


version: "3.5"

services:
  # Install postgres and setup the student database.
  postgres:
    container_name: postgres
    image: debezium/postgres
    ports:
      - 5432:5432
    environment:
      - POSTGRES_DB=studentdb
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password

  # Install Elasticsearch.
  elasticsearch:
    container_name: elasticsearch
    image: docker.elastic.co/elasticsearch/elasticsearch:6.8.0
    environment:
    - discovery.type=single-node
    ports:
      - 9200:9200
      - 9300:9300


We use the image debezium/postgres because it comes prebuilt with the logical decoding feature. This is a mechanism that allows the extraction of changes that were committed to the transaction log, making CDC possible. The documentation for installing the plugin to Postgres can be found here.

Understanding the Code

The first step is to define the Maven dependencies in the pom.xml for debezium-embedded and debezium-connector. The sample reads the changes from Postgres, so we use the Postgres connector.

<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-embedded</artifactId>
    <version>${debezium.version}</version>
</dependency>
<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-connector-postgres</artifactId>
    <version>${debezium.version}</version>
</dependency>


Then, we configure the connector, which listens to changes on the Student table. We use the class PostgresConnector for the connector.class setting, which is provided by Debezium. This is the name of the Java class for the connector, which tails the source database.

The connector also takes an important setting, offset.storage, which helps the application keep track of how much it has processed from the transaction log. Should the application fail while processing, it can resume reading the changes from the point it failed after restart.

There are multiple ways of storing offsets, but in this example we use the class FileOffsetBackingStore to store offsets in a local file defined by offset.storage.file.filename. The connector records the offsets within the file, and, for every change it reads, the Debezium engine flushes the offsets to the file periodically based on setting offset.flush.interval.ms.

The other parameters to the connector are the Postgres database properties which house the Student table.


@Bean
public io.debezium.config.Configuration studentConnector() {
    return io.debezium.config.Configuration.create()
            .with("connector.class", "io.debezium.connector.postgresql.PostgresConnector")
            .with("offset.storage",  "org.apache.kafka.connect.storage.FileOffsetBackingStore")
            .with("offset.storage.file.filename", "/path/cdc/offset/student-offset.dat")
            .with("offset.flush.interval.ms", 60000)
            .with("name", "student-postgres-connector")
            .with("database.server.name", studentDBHost+"-"+studentDBName)
            .with("database.hostname", studentDBHost)
            .with("database.port", studentDBPort)
            .with("database.user", studentDBUserName)
            .with("database.password", studentDBPassword)
            .with("database.dbname", studentDBName)
            .with("table.whitelist", STUDENT_TABLE_NAME).build();
}


The final change to setup embedded Debezium is to start it when the application starts up. For this, we use the class EmbeddedEngine, which acts as a wrapper for the connector and manages the connectors lifecycle. The engine is created using the connector configuration and a function that it will call for every data change event — in our example the method handleEvent().


private CDCListener(Configuration studentConnector, StudentService studentService) {
    this.engine = EmbeddedEngine
            .create()
            .using(studentConnector)
            .notifying(this::handleEvent).build();

    this.studentService = studentService;
}


On the handleEvent() we parse every event, identify which operation took place, and invoke the StudentService to perform the Create/Update/Delete operations on Elastic Search using Spring Data JPA for Elasticsearch.

Now that we have setup the EmbeddedEngine, we can start it asynchronously using the service Executor.


    private final Executor executor = Executors.newSingleThreadExecutor();

    @PostConstruct
    private void start() {
        this.executor.execute(engine);
    }

    @PreDestroy
    private void stop() {
        if (this.engine != null) {
            this.engine.stop();
        }
    }


Seeing the Code in Action

We then start all the required tools by running the docker-compose file, using the command docker-compose up -d and starting the ‘Student CDC Relay’ using the command mvn spring-boot:run. We can set up the Student table by running the following script:


CREATE TABLE public.student
(
    id integer NOT NULL,
    address character varying(255),
    email character varying(255),
    name character varying(255),
    CONSTRAINT student_pkey PRIMARY KEY (id)
);


To see the code in action, we make data changes on the table we just created.

Inserting a Record to the Student Table

We can run the following SQL statement to insert a record in to the Student table in our Postgres database.


INSERT INTO STUDENT(ID, NAME, ADDRESS, EMAIL) VALUES('1','Jack','Dallas, TX','jack@gmail.com');


We can verify that a record was created on ElasticSearch.


$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
  "_index" : "student",
  "_type" : "student",
  "_id" : "1",
  "_version" : 31,
  "_seq_no" : 30,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "id" : 1,
    "name" : "Jack",
    "address" : "Dallas, TX",
    "email" : "jack@gmail.com"
  }
}


 Updating a Record in the Student Table

We can run the following SQL statement to update a record on the Student table in our Postgres database.


UPDATE STUDENT SET EMAIL='jill@gmail.com', NAME='Jill' WHERE ID = 1; 


We can verify that the data has been changed to ‘Jill’ on ElasticSearch.


$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
  "_index" : "student",
  "_type" : "student",
  "_id" : "1",
  "_version" : 32,
  "_seq_no" : 31,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "id" : 1,
    "name" : "Jill",
    "address" : "Dallas, TX",
    "email" : "jill@gmail.com"
  }
}


 Deleting a Record in the Student Table

We can run the following SQL statement to delete a record from the Student table in our Postgres database.


DELETE FROM STUDENT WHERE ID = 1;


We can verify that the data has been deleted with ElasticSearch.


$ curl -X GET http://localhost:9200/student/student/1?pretty=true
{
  "_index" : "student",
  "_type" : "student",
  "_id" : "1",
  "_version" : 33,
  "_seq_no" : 32,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "id" : 1,
    "name" : null,
    "address" : null,
    "email" : null
  }
}


Final Thoughts

This approach is indeed far simpler with few moving parts, but it is more limited in terms of scaling and far less tolerant of failures.

The source record will be handled exactly once when the CDC-Relay application is running fine. The underlying applications do need to be tolerant of receiving duplicate events following a restart of the CDC-Relay application.

We can test the limitations around scaling by starting another instance of the ‘Student CDC Relay’ [on another port]. We see the below exception:

ERROR 59453 --- [pool-2-thread-1] io.debezium.embedded.EmbeddedEngine      : Error while trying to run connector class 'io.debezium.connector.postgresql.PostgresConnector'

Caused by: org.postgresql.util.PSQLException: ERROR: replication slot "debezium" is active for PID <>
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1116) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.QueryExecutorImpl.startCopy(QueryExecutorImpl.java:842) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.replication.V3ReplicationProtocol.initializeReplication(V3ReplicationProtocol.java:58) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.core.v3.replication.V3ReplicationProtocol.startLogical(V3ReplicationProtocol.java:42) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.replication.fluent.ReplicationStreamBuilder$1.start(ReplicationStreamBuilder.java:38) ~[postgresql-42.2.5.jar:42.2.5]
at org.postgresql.replication.fluent.logical.LogicalStreamBuilder.start(LogicalStreamBuilder.java:37) ~[postgresql-42.2.5.jar:42.2.5]


If your application needs at-least-once delivery guarantees of all messages, it would be better to use the the full distributed Debezium system with Kafka-Connect.


Further Reading

  • Building a Custom Kafka Connect Connector.
  • Connecting Apache Kafka With Mule ESB.
  • Spring Boot and Elasticsearch Tutorial.
Spring Framework Database Data (computing) Spring Boot kafka Change data capture application Connector (mathematics)

Opinions expressed by DZone contributors are their own.

Related

  • Improving Backend Performance Part 1/3: Lazy Loading in Vaadin Apps
  • Spring Data: Data Auditing Using JaVers and MongoDB
  • Using PostgreSQL pgoutput Plugin for Change Data Capture With Debezium
  • Spring Boot Application With Kafka, Elasticsearch, Redis With Enterprise Standards Part 1

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!