Change Data Capture for Real-Time BI and Data Warehousing

DZone 's Guide to

Change Data Capture for Real-Time BI and Data Warehousing

Real-time is the right time — everything you need for data capture for BI and data warehousing.

· Big Data Zone ·
Free Resource


As data storage and aggregation needs, and the associated desire to harness intelligence from the aggregated data, have increased, so too has the desire to access that data at any time and from anywhere. Data warehouses have performed their duty of aggregating data to provide a single source of business intelligence, but we’re still refining our ability to meet the demands for constant, uninterrupted connection — without having to schedule downtime due to maintenance, back-ups, etc.

This is where replication technology such as Change Data Capture (CDC) plays a role — performing tasks in the background and making sure we can always access the data that powers real-time intelligence.

You may also like: Create a CDC Event Stream From Oracle Database to Kafka With GoldenGate.

We can think of a data warehouse as a stockroom, a large area where data consolidated from different sources can be integrated, archived, and stored for analysis; they’ve been around for decades...Teradata, the grandfather of the data warehouse, built its database on a design principle where everything is parallelized, with no single bottleneck to limit performance and scalability.

Early data warehouses of the 1980s and 90s were expensive to deploy and maintain. However, with the right focus and implementation, the value obtained from the data warehouse proved to be tremendous, and case studies documented the successes that companies, such as Walmart and AT&T achieved with their data warehouse efforts. 

Data warehouses continue to be very successful analytical solutions for organizations looking to optimize core business processes, save costs, and minimize risks. They’ve traditionally focused on consolidating data from a variety of transactional systems. Data from Packaged Enterprise Resource Planning (ERP) applications, Supply Chain Management (SCM) solutions, and Customer Relationship Management (CRM) software feed the data warehouse, as do many other industry-specific and home-grown data sources.

With the proliferation of cloud-based technology, there are even more data sources and targets to account for in a business’ BI strategy. Therefore, integrating data into the data warehouse continues to be an essential consideration for data warehouse initiatives.

Until quite recently, ETL jobs ran once a day to populate analytical systems. This once-a-day approach worked well because systems typically had a period during the day (or night) when the system was not very active, allowing data extract jobs to run without impacting the performance of the source transactional systems.

However, in our global, connected world, systems are active 24/7, making it less acceptable to initiate heavy data extraction jobs. Further, organizations see value in quicker access to analytical data — to gain a competitive edge, lower fraud, etc. Real-time data is essential to modern business. 

Fueling a Real-Time Data Warehouse With Log-Based Change Data Capture (CDC)

Data warehouses consolidate data for a single source of BI

Data warehouses consolidate data for a single source of BI

With a real-time data warehouse, companies can make decisions quicker based on more current, more accurate, and transactionally consistent data. This is where heterogeneous data replication technology like log-based Change Data Capture (CDC) is useful. As its name implies, CDC identifies and then synchronizes incremental changes with another system, or stores an audit trail of changes.

CDC comes in multiple flavors, including trigger-based and log-based. Transactional databases store all changes in a transaction log in order to recover the committed state of the database should the database crash for whatever reason.

Log-based CDC requires no additional table updates or query processing — it reads directly from logs without impacting the transaction, and therefore has less impact on the database. In contrast, trigger-based CDC creates triggers on tables that require change data capture, and firing these slows down transactions.

Because log-based CDC has minimal impact on the transaction processing applications, it can be applied to all possible scenarios, including systems with extremely high transaction volumes. With ongoing real-time data replication using log-based CDC, there is no need for a regular bulk load between the source database and the ODS. With log-based CDC, data moves more quickly with less pressure on resources. Changes can be processed much closer to real-time, with data latency being measured in seconds or even sub-seconds in some cases. 

Companies must arm their BI teams with a constant stream of real-time data to make the tactical day-to-day decisions needed to stay competitive. Powering a real-time data warehouse with log-based CDC accomplishes that goal and helps organizations realize the full potential of their business intelligence solutions.

Related Articles

change data capture, data integration, data warehouse, real-time analytics, real-time data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}