DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Implementing MongoDB to Elastic Search 7.X Data Pipeline
  • Python Function Pipelines: Streamlining Data Processing
  • Unlocking the Power Duo: Kafka and ClickHouse for Lightning-Fast Data Processing
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive

Trending

  • The Perfection Trap: Rethinking Parkinson's Law for Modern Engineering Teams
  • The Role of AI in Identity and Access Management for Organizations
  • Navigating Change Management: A Guide for Engineers
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  1. DZone
  2. Data Engineering
  3. Data
  4. Real-Time Data Processing Pipeline With MongoDB, Kafka, Debezium and RisingWave

Real-Time Data Processing Pipeline With MongoDB, Kafka, Debezium and RisingWave

This article will cover the powerful combination of MongoDB, Kafka, Debezium, and RisingWave to analyze real-time data, how they work together.

By 
Bobur Umurzokov user avatar
Bobur Umurzokov
·
Aug. 08, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.3K Views

Join the DZone community and get the full member experience.

Join For Free

Today, the demand for real-time data processing and analytics is higher than ever before. The modern data ecosystem requires tools and technologies that can not only capture, store, and process vast amounts of data but also it should deliver insights in real time. This article will cover the powerful combination of MongoDB, Kafka, Debezium, and RisingWave to analyze real-time data, how they work together, and the benefits of using these open-source technologies.

Understanding Debezium and RisingWave

Before we dive into the implementation details, it’s important to understand what these two tools are and what they do.

  1. Debezium: is an open-source distributed platform for change data capture(CDC). CDC is a technique to track data changes written to a source database and automatically sync target databases. For example, Debezium’s MongoDB Connector can monitor for document changes in databases and collections as they occur in real-time, recording those changes as events in Kafka topics.
  2. RisingWave: is a distributed open-source SQL database for stream processing. Its main goal is to make it easier and cheaper to construct applications that operate in real-time. As it takes in streaming data, RisingWave performs on-the-fly computations with each new piece of data and promptly updates the outcomes. For example, RisingWave accepts data from sources like Kafka, constructs materialized views for complex data, and you can query them using SQL.

Analyzing Real-Time Data: The Pipeline

Once we have knowledge about each tool, let’s discuss how MongoDB, Kafka, Debezium, and RisingWave can work together to create an efficient real-time data analysis pipeline. These technologies are free to use and easy to integrate with each other.

 Analyzing Real-Time Data: The Pipeline
  1. Data Generation and Storage in MongoDB: Our data pipeline starts with the generation and storage of data in MongoDB. Given MongoDB’s flexible data model, it is possible to store data in multiple formats, making it suitable for diverse data sources.
  2. Data Capture with Debezium: The next step in the pipeline is the capture of changes (all of the inserts, updates, and deletes) in MongoDB using Debezium. Debezium provides a CDC connector for MongoDB that can capture row-level changes in the database. Once the changes are captured, they are sent to Kafka for processing.
  3. Data Streaming with Kafka: Kafka receives the data from Debezium and then takes care of streaming it to the consumers. In our case, we consume data with RisingWave.
  4. Data Processing with RisingWave: Finally, the streamed data is received and processed by RisingWave. RisingWave provides a high-level SQL interface for complex event processing and streaming analytics. The processed data can be passed to BI and Data analytics platforms or used for real-time decision-making, anomaly detection, predictive analytics, and much more.

This pipeline’s key strengths are its ability to handle large volumes of data, process events in real time, and produce insights with minimal latency. For example, this solution can be used for building a global hotel search platform to get real-time updates on hotel rates and availability. When rates or availability change in one of the platform’s primary databases, Debezium captures this change and streams it to Kafka, and RisingWave can do trend analysis. This ensures that users always see the most current information when they search for hotels.

How To Integrate Quickstart

This guide shows you how to configure technically the MongoDB Debezium Connector to send data from MongoDB to Kafka topics and ingest data into RisingWave.

After completing this guide, you should understand how to use these tools to create a real-time data processing pipeline and create a data source and materialized view in RisingWave to analyze data with SQL queries.

To complete the steps in this guide, you must download/clone and work on an existing sample project on GitHub. The project uses Docker for convenience and consistency. It provides a containerized development environment that includes the services you need to build the sample data pipeline.

Before You Begin

To run the project in your local environment, you need the following.

  • Git
  • Ensure you have Docker installed in your environment.
  • Ensure that the PostgreSQL interactive terminal, psql, is installed in your environment. For detailed instructions, see Download PostgreSQL.

Start the Project

The docker-compose file starts the following services in Docker containers:

  • RisingWave Database.
  • MongoDB, configured as a replica set.
  • Debezium.
  • Python app to generate random data for MongoDB.
  • Redpanda with the MongoDB Debezium Connector installed. We use Redpanda as a Kafka broker.
  • Kafka Connect UI to manage and monitor our connectors.

To start the project, simply run the following command from the tutorial directory:

docker compose up


When you start the project, Docker downloads any images it needs to run. You can see the full list of services in docker-compose.yaml file.

Data Flow

App.py generates random user data (name, address, and email) and inserts them into MongoDB users collection. Because we configured the Debezium MongoDB connector to point to the MongoDB database and the collection we want to monitor, it captures data in real time and sinks them to Redpanda into a Kafka topic called dbserver1.random_data.users. Next steps, we consume Kafka events and create a materialized view using RisingWave.

Create a Data Source

To consume the Kafka topic with RisingWave, we first need to set up a data source. In the demo project, Kafka should be defined as the data source. Open a new terminal window and run to connect to RisingWave:

psql -h localhost -p 4566 -d dev -U root


As RisingWave is a database, you can directly create a table for the Kafka topic:

CREATE TABLE users (_id JSONB PRIMARY KEY, payload JSONB) WITH (
    connector = 'kafka',
    kafka.topic = 'dbserver1.random_data.users',
    kafka.brokers = 'message_queue:29092',
    kafka.scan.startup.mode = 'earliest'
) ROW FORMAT DEBEZIUM_MONGO_JSON;


Normalize Data With Materialized Views

To normalize user’s data, we create a materialized view in RisingWave:

CREATE MATERIALIZED VIEW normalized_users AS
SELECT
    payload ->> 'name' as name,
    payload ->> 'email' as email,
    payload ->> 'address' as address
FROM
    users;
The main benefit of materialized views is that they save the computation needed to perform complex joins, aggregations, or calculations. Instead of running these operations each time data is queried, the results are calculated in advance and stored.

Query Data

Use the SELECT command to query data in the materialized view. Let's see the latest results of the normalized_users materialized view:

SELECT
    *
FROM
    normalized_users
LIMIT
    10;


In response to your query, a possible result set (with random data) might look like:

data set results

Summary

This is a basic setup for using MongoDB, Kafka, Debezium, and RisingWave for a real-time data processing pipeline. The setup can be adjusted based on your specific needs, such as adding more Kafka topics, tracking changes in multiple MongoDB collections, implementing more complex data processing logic, or combining multiple streams in RisingWave.

Related Resources

  • Debezium connector for MongoDB
  • MongoDB Kafka Connector

Recommended Content

  • Query Real-Time Data in Kafka Using SQL.
  • How Streaming database differs from a Traditional database?
Data processing MongoDB Data (computing) kafka Pipeline (software)

Published at DZone with permission of Bobur Umurzokov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Implementing MongoDB to Elastic Search 7.X Data Pipeline
  • Python Function Pipelines: Streamlining Data Processing
  • Unlocking the Power Duo: Kafka and ClickHouse for Lightning-Fast Data Processing
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!