DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Point-In-Time Recovery (PITR) in PostgreSQL
  • Generate Random Test Data in PostgreSQL
  • The Dual Write Problem: What Looks Safe in Code but Breaks in Production
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 1]

Trending

  • Fact-Checking LLM Outputs Programmatically: Building a Verification Layer That Catches Hallucinations
  • What Is Plagiarism? How to Avoid It and Cite Sources
  • PostgreSQL Everywhere and for Everything
  • Designing a Secure API From Day One
  1. DZone
  2. Data Engineering
  3. Databases
  4. Why PostgreSQL CDC Breaks in Production

Why PostgreSQL CDC Breaks in Production

PostgreSQL CDC often fails after WAL reading: snapshot handoff gaps, unsafe checkpoints, bad ordering, and retry logic can silently corrupt replicated data.

By 
Dmitry Narizhnykh user avatar
Dmitry Narizhnykh
DZone Core CORE ·
May. 06, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.4K Views

Join the DZone community and get the full member experience.

Join For Free

Keeping two PostgreSQL databases in sync sounds simple. Until it isn’t.

At first, everything looks fine:

  • Logical replication is enabled
  • Changes are flowing
  • The target database looks current

Then, a few days later, something is off:

  • Rows are missing
  • Some updates appear twice
  • Replication lag jumps for no obvious reason
  • A small schema change breaks the pipeline
  • Restarting the job does not clearly continue from the right place

Now the problem is no longer “how do I stream changes from PostgreSQL?” The problem is proving that the target database is still correct.

That is where most PostgreSQL CDC guides stop being useful. They explain how to enable logical replication. They explain replication slots, publications, WAL, maybe Debezium.

That is useful.

But production CDC usually breaks somewhere else: in the handoff between initial load and CDC, in retries, in checkpoints, in ordering, and in the recovery paths nobody tests until something fails.

The Promise of PostgreSQL CDC

Change data capture exists for a good reason.

Instead of repeatedly querying whole tables or running batch exports, PostgreSQL can stream committed changes from its write-ahead log.

In theory, that gives you:

  • Near real-time replication
  • Less load on the source database
  • No polling loops
  • No full reload after every change

And yes, that part works.

WAL is reliable. Logical replication is mature. PostgreSQL can tell you what changed and in what order. The hard part starts after that. Because WAL is only one part of the system.

Once changes leave PostgreSQL, they usually pass through readers, queues, workers, retry logic, target writes, checkpoints, and monitoring. That is where things get interesting. Also annoying.

Where PostgreSQL CDC Actually Breaks

A simplified CDC pipeline often looks like this:

Plain Text
 
PostgreSQL WAL
      |
      v
 CDC reader
      |
      v
 Queue / buffer
      |
      v
 Workers
   |      |
   v      v
Target   Retry / failure handling


PostgreSQL WAL is ordered. Your pipeline may not be.

The moment you add queues, parallel workers, retries, and target writes, correctness becomes your responsibility. Not just throughput. Not just “events per second.” Correctness.

That means answering questions like:

  • Did the target receive every committed change?
  • Were updates applied in the right order?
  • Did a retry apply the same change twice?
  • Was the checkpoint saved before or after the target write?
  • What happens if the job stops halfway through a large initial load?
  • What happens if CDC starts after the snapshot, but not from the snapshot boundary?

Those are the questions that decide whether CDC is reliable in production.

1. Initial Load Is Not CDC

CDC starts from a point in time. It does not recreate everything that already existed before that point.

So if the target is empty, you first need a baseline copy of the existing data. Usually, this is called an initial load or snapshot.

The hard part is what happens next.

CDC snapshot gap
CDC snapshot gap


If CDC starts from the wrong position, the target may miss rows, replay rows, or apply stale updates.

This is the snapshot gap. It appears when initial load and CDC are run as separate steps without a shared WAL boundary.

snapshot + CDC is two separate operations.
snapshot → CDC is a controlled handoff.
This is the difference between “snapshot + CDC” and a continuous snapshot → CDC flow. For PostgreSQL, the safe version means:
  1. Read a consistent snapshot
  2. Record the matching WAL position
  3. Start CDC from that position

Anything else is guessing.

2. WAL Is Ordered, But Workers Can Break Ordering

PostgreSQL emits changes in commit order. That does not mean your target receives them in a safe order.

Once a pipeline introduces parallelism, ordering can break. For example:

Plain Text
 
Event 1: update customer id = 42
Event 2: delete customer id = 42


If these events are processed by different workers, the delete might reach the target first.

Or this:

Plain Text
 
Event 1: insert parent row
Event 2: insert child row


If the child insert arrives first, the target may reject it because the parent row does not exist yet. The usual answer is controlled parallelism:

  • Preserve ordering per table where needed
  • Avoid unsafe reordering inside the same key/table stream
  • Batch carefully
  • Retry without changing event order

Parallelism is not bad. Blind parallelism is bad.

CDC systems usually fail here when they optimize for speed before defining ordering guarantees.

3. At-Least-Once Delivery Means Duplicates Are Normal

Most CDC pipelines are at least once. That means an event may be delivered more than once. This is not automatically a bug.

It is a normal recovery behavior. If the pipeline writes to the target, then crashes before saving its checkpoint, it may replay the same event after restart. That is why target writes must be idempotent.

For database-to-database replication, this usually means:

  • Inserts should behave like upserts when possible
  • Updates should be safe to apply more than once
  • Deletes should not fail the whole stream if the row is already gone
  • Primary keys matter

If the target write logic is not idempotent, retries can silently corrupt data. For example:

  • Duplicate inserts
  • Counters incremented twice
  • Audit rows repeated
  • Append-only targets growing incorrect history

CDC without idempotent target writes is fragile.

It may work in a demo.
Production will eventually find the retry path. Production has a talent for that.

4. Checkpoints Must Be Commit-Aware

Checkpointing sounds simple:

Save the last processed WAL position.

But the timing matters.

If the checkpoint is saved too early, the failure path looks like this:

  1. CDC reads an event at LSN X.
  2. Checkpoint advances to X.
  3. Target write fails.
  4. Process crashes before the failure is handled.
  5. Restart begins after X, so the event is skipped.

The system now believes the event was delivered. But it was never written to the target.

That is silent data loss. The safe order is:

  1. Read event
  2. Write to target
  3. Wait for target ACK
  4. Advance checkpoint

This way, if the process crashes after reading but before committing to the target, the event can be replayed. That may create duplicates if target writes are not idempotent, but it avoids skipping committed source changes.

This is the usual tradeoff:

  • Checkpoint too early → data loss
  • Checkpoint after commit → possible replay
  • Replay + idempotency → safe recovery

A reliable CDC system must choose the boring, safe option.

5. Schema Changes Are Not Free

CDC captures data changes. It does not magically solve schema evolution. In real systems, someone eventually:

  • Adds a column
  • Changes a type
  • Renames a table
  • Drops a column
  • Changes a default
  • Modifies a constraint

Then the CDC pipeline has to answer:

  • Does the target schema already have this column?
  • Can this type be mapped safely?
  • Should the pipeline stop or continue?
  • What happens to old events?
  • What happens to in-flight batches?

Some platforms try to automate schema evolution. That can be useful, especially in analytics pipelines.

But for database-to-database replication, automatic schema changes can also be dangerous. A production target is not always just a passive copy. It may have constraints, indexes, permissions, triggers, or application dependencies.

The safest practical answer is usually:

  • Detect schema mismatch clearly
  • Fail loudly when target writes are unsafe
  • Let operators coordinate schema changes
  • Do not silently invent a broken target schema

A CDC pipeline that keeps running incorrectly is worse than one that stops. At least a stopped pipeline is honest.

6. Long Transactions Create Hidden Lag

CDC lag is not always caused by slow networking or slow consumers. Sometimes the source transaction itself is the problem. PostgreSQL changes become safe to replicate only after the transaction commits.

So a large transaction can look quiet for a while, then suddenly release a huge batch of changes at once.

PLSQL
 
BEGIN;
UPDATE orders
SET status = 'archived'
WHERE created_at < '2024-01-01';

– this runs for several minutes
– CDC cannot treat these row changes as final yet

COMMIT;


While the transaction is open, downstream replication may appear to be stuck or falling behind.

After COMMIT, all those changes become visible together. Result:

  • Replication lag jumps
  • Target writes arrive in a burst
  • Workers suddenly have a backlog
  • Monitoring graphs look haunted
Long transaction
Long transaction


This is common during bulk updates, maintenance jobs, large imports, or application code that keeps transactions open too long.

CDC cannot remove this behavior. It can only process the changes once PostgreSQL makes them committed and visible.

7. Restarts Are Where Fake Reliability Gets Exposed

A CDC pipeline that works while everything is healthy is not enough.

The real test is what happens after:

  • Service restart
  • Database disconnect
  • Target write failure
  • Process crash
  • Machine reboot
  • Operator pressing "Stop"

Restart behavior must be explicit. The system should know:

  • The last durable source position
  • Whether the target write was acknowledged
  • Whether the initial load had completed
  • Whether CDC handoff had happened
  • Whether a partially loaded table can continue safely

If those states are not stored durably, restart becomes guesswork. And guesswork is not a recovery strategy. Treat CDC as a workflow, not just a stream.

Most real database movement work does not start with CDC. It starts with questions:

  • Which tables should be copied?
  • How large are they?
  • Does the target schema match?
  • Can the existing data be loaded safely?
  • Where is the handoff point between the initial load and CDC?
  • How do we validate the initial load?
  • How do we keep validating after the CDC starts?
  • How do we recover if something stops?

But many setups split those steps across tools:

Plain Text
 
SQL client → export → scripts → pipeline → CDC → validation


Each tool may be fine on its own. The problems live in the gaps:

  • Assumptions are lost
  • State is not shared
  • Validation becomes manual
  • Handoff points are unclear
  • Restart behavior is inconsistent

That is why CDC should be treated as part of the full data movement workflow:

Plain Text
 
explore → load → validate → replicate → keep validating


Not because workflows look nice on a diagram, but because the correctness problems happen between those steps.

What a Reliable CDC System Needs

A production CDC system should handle failure and recovery paths deliberately:

  • Pure CDC without initial load: target writes must be idempotent, and checkpoints must be durable.
  • Initial load → CDC: CDC must start from the snapshot boundary.
  • Restart after stop: checkpoints must advance only after successful target writes.
  • Interrupted large load: the system must know what was already copied.
  • Delayed CDC after snapshot: the system must not start blindly from “now.”
  • Schema mismatch: the system should fail clearly, not silently corrupt data.

These scenarios look boring. They are also where many CDC implementations fail.

Not because PostgreSQL is unreliable. Because the workflow around PostgreSQL CDC is incomplete.

When Debezium + Kafka is the right answer

A common production CDC architecture looks like this:

Plain Text
 
PostgreSQL → Debezium → Kafka → consumer → target database


This can be the right architecture. Especially when you need:

  • Kafka as the central event backbone
  • Multiple independent consumers
  • Event-driven services
  • Very high throughput
  • Existing Kafka operations

Debezium is a serious CDC tool. Kafka is not the villain. The problem starts when the architecture is much bigger than the job.

If the goal is simply PostgreSQL → PostgreSQL replication, the stack becomes a distributed system around a relatively direct task.

Now every issue has several possible owners:

  • PostgreSQL WAL
  • Replication slot state
  • Debezium connector config
  • Kafka topic lag
  • Consumer retry logic
  • Rarget database writes
  • Schema handling between all of the above

When lag appears, where is it? When a row is missing, who skipped it? When an event replays, was it Debezium, Kafka, the consumer, or the checkpoint? When the target schema changes, which layer owns the fix?

Nothing here is impossible. But it changes the problem. You are no longer just moving data. You are operating a multi-component CDC platform.

That may be worth it. But it should be a conscious tradeoff, not the default answer for every sync job. Kafka did not break. The architecture became heavier than the job required.

Where DBConvert Streams Fits

DBConvert Streams keeps the risky parts of PostgreSQL CDC in one workflow:

Plain Text
 
load → handoff → replicate → resume → validate


That does not remove the hard parts of CDC. It makes them explicit.

Instead of stitching together a snapshot job, a CDC process, retry logic, checkpoints, and validation queries by hand, the workflow is visible in one place.

What Changed in DBConvert Streams 2.1

DBConvert Streams 2.1 focuses on several of these recovery paths:

  • Initial load → CDC now hands off automatically from a saved position.
  • CDC resumes from the last durable checkpoint after Stop or restart.
  • Eligible large load runs can continue from saved progress instead of starting again from zero.
  • Schema changes are still not handled automatically and need coordination.

These are workflow changes, not new WAL magic. That is the point.

What DBConvert Streams Does Not Solve Automatically

DBConvert Streams 2.1 does not automatically handle:

  • Schema evolution
  • Exactly-once delivery across source, pipeline, and target
  • Lag caused by long PostgreSQL transactions
  • Target repair after manual changes or divergence

These are still operational boundaries.

When CDC Is the Wrong Solution

CDC is not always the answer.

Use something simpler if:

  • Data changes rarely
  • Latency does not matter
  • A nightly reload is acceptable
  • The target can be rebuilt cheaply
  • Correctness matters more than freshness
Batch jobs are boring.
But boring is not an insult.
Boring systems often fail in predictable ways.


A full reload that takes 10 minutes and is easy to verify may be better than a CDC pipeline nobody fully understands. CDC is worth it when freshness matters, and the source cannot be repeatedly reloaded.

Otherwise, do not add moving parts just to feel enterprise. PostgreSQL will not be impressed. The important part is that they are explicit, not hidden behind a “CDC just works” promise.

Final Thought

PostgreSQL CDC is not hard because WAL is unreliable.

It is hard because a real CDC system has state:

  • Snapshot state
  • WAL position
  • Checkpoint state
  • Target commit state
  • Retry state
  • Schema state

If that state is implicit, CDC breaks in strange ways. If it is explicit, CDC becomes boring. And boring is exactly what production replication should be.

DBConvert Streams 2.1 handles this as one controlled workflow: initial load, CDC handoff, checkpointing, resume, and monitoring.

See: Log-based CDC for MySQL and PostgreSQL

Database Production (computer science) PostgreSQL write-ahead logging

Published at DZone with permission of Dmitry Narizhnykh. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Point-In-Time Recovery (PITR) in PostgreSQL
  • Generate Random Test Data in PostgreSQL
  • The Dual Write Problem: What Looks Safe in Code but Breaks in Production
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 1]

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook