Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service
Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Reactive Event Streaming Architecture With Kafka, Redis Streams, Spring Boot, and HTTP Server-Sent Events (SSE)
Integration Testing of Non-Blocking Retries With Spring Kafka
Do you ever wonder about a solution that you know or you wrote is the best solution, and nothing can beat that in the years to come? Well, it’s not quite how it works in the ever-evolving IT industry, especially when it comes to big data processing. From the days of Apache Spark and the evolution of Cassandra 3 to 4, the landscape has witnessed rapid changes. However, a new player has entered the scene that promises to dominate the arena with its unprecedented performance and benchmark results. Enter ScyllaDB, a rising star that has redefined the standards of big data processing. The Evolution of Big Data Processing To appreciate the significance of ScyllaDB, it’s essential to delve into the origins of big data processing. The journey began with the need to handle vast amounts of data efficiently. Over time, various solutions emerged, each addressing specific challenges. From the pioneering days of Hadoop to the distributed architecture of Apache Cassandra, the industry witnessed a remarkable evolution. Yet, each solution presented its own set of trade-offs, highlighting the continuous quest for the perfect balance between performance, consistency, and scalability. You can check here at the official website for benchmarks and comparisons with Cassandra and Dynamo DB. Understanding the Big Data Processing and NoSQL Consistency Big data processing brought about a paradigm shift in data management, giving rise to NoSQL databases. One of the pivotal concepts in this realm is eventual consistency, a principle that allows for distributed systems to achieve availability and partition tolerance while sacrificing strict consistency. This is closely tied to the CAP theorem, which asserts that a distributed system can achieve only two out of three: Consistency, Availability, and Partition Tolerance. As organizations grapple with the complexities of this theorem, ScyllaDB has emerged as a formidable contender that aims to strike an optimal balance between these factors. You can learn more about the CAP Theorem in this video. Fine-Tuning Performance and Consistency With ScyllaDB ScyllaDB enters the arena with a promise to shatter the conventional limits of big data processing. It achieves this by focusing on two critical aspects: performance and consistency. Leveraging its unique architecture, ScyllaDB optimizes data distribution and replication to ensure minimal latency and high throughput. Moreover, it provides tunable consistency levels, enabling organizations to tailor their database behavior according to their specific needs. This flexibility empowers users to strike a harmonious equilibrium between data consistency and system availability, a feat that was often considered challenging in the world of big data. The Rust Advantage ScyllaDB provides support for various programming languages, and each has its strengths. However, one language that stands out is Rust. Rust’s focus on memory safety, zero-cost abstractions, and concurrency. Its robustness significantly reduces common programming errors, bolstering the overall stability of your application. When evaluating the choice of programming language for your project, it’s essential to consider the unique advantages that Rust brings to the table, alongside other supported languages like Java, Scala, Node.js, and more. Each language offers its own merits, allowing you to tailor your solution to your specific development needs. One Last Word About “ScyllaDB and Rust Combination” Scylla with Rust brings together the performance benefits of the Scylla NoSQL database with the power and capabilities of the Rust programming language. Just as Apache Spark and Scala or Cassandra and Java offer synergistic combinations, Scylla’s integration with Rust offers a similar pairing of a high-performance database with a programming language known for its memory safety, concurrency, and low-level system control. Rust’s safety guarantees make it a strong choice for building system-level software with fewer risks of memory-related errors. Combining Rust with Scylla allows developers to create efficient, safe, and reliable applications that can harness Scylla’s performance advantages for handling large-scale, high-throughput workloads. This pairing aligns well with the philosophy of leveraging specialized tools to optimize specific aspects of application development, akin to how Scala complements Apache Spark or Java complements Cassandra. Ultimately, the Scylla and Rust combination empowers developers to build resilient and high-performance applications for modern data-intensive environments. Demo Time Imagine handling a lot of information smoothly. I’ve set up a way to do this using three main parts. First, we keep making new users. Then, we watch this data using a Processor, which can change it if needed. Lastly, we collect useful insights from the data using Analyzers. This setup is similar to how popular pairs like Apache Spark and Scala or Cassandra and Java work together. We’re exploring how Scylla, a special database, and Rust, a clever programming language, can team up to make this process efficient and safe. Set Up Scylla DB YAML services: scylla: image: scylladb/scylla ports: - "9042:9042" environment: - SCYLLA_CLUSTER_NAME=scylladb-bigdata-demo - SCYLLA_DC=dc1 - SCYLLA_LISTEN_ADDRESS=0.0.0.0 - SCYLLA_RPC_ADDRESS=0.0.0.0 Create a Data Generator, Processor, and Analyzer Shell mkdir producer processor analyzer cargo new producer cargo new processor cargo new analyzer Producer Rust [tokio::main] async fn main() -> Result<(), Box<dyn Error>> { let uri = std::env::var("SCYLLA_CONTACT_POINTS") .unwrap_or_else(|_| "127.0.0.1:9042".to_string()); let session = SessionBuilder::new() .known_node(uri) .compression(Some(Compression::Snappy)) .build() .await?; // Create the keyspace if It doesn't exist session .query( "CREATE KEYSPACE IF NOT EXISTS ks WITH REPLICATION = \ {'class' : 'SimpleStrategy', 'replication_factor' : 1}", &[], ) .await?; // Use the keyspace session .query("USE ks", &[],) .await?; // toTimestamp(now()) // Create a Table if doesn't exist session .query("CREATE TABLE IF NOT EXISTS ks.big_data_demo_table (ID UUID PRIMARY KEY, NAME TEXT , created_at TIMESTAMP)", &[],) .await?; loop { let id = Uuid::new_v4(); let name = format!("User{}", id); let name_clone = name.clone(); session .query( "INSERT INTO ks.big_data_demo_table (id, name, created_at) VALUES (?, ?, toTimestamp(now()))", (id, name_clone), ) .await?; println!("Inserted: ID {}, Name {}", id, name); let delay = rand::thread_rng().gen_range(1000..5000); // Simulate data generation time sleep(Duration::from_millis(delay)).await; } } Processor Rust #[tokio::main] async fn main() -> Result<(), Box<dyn Error>> { let uri = std::env::var("SCYLLA_CONTACT_POINTS") .unwrap_or_else(|_| "127.0.0.1:9042".to_string()); let session = SessionBuilder::new() .known_node(uri) .compression(Some(Compression::Snappy)) .build() .await?; let mut last_processed_time = SystemTime::now(); loop { // Calculate the last processed timestamp let last_processed_str = last_processed_time .duration_since(SystemTime::UNIX_EPOCH) .expect("Time went backwards") .as_secs() as i64; // Convert to i64 let query = format!( "SELECT id, name FROM ks.big_data_demo_table WHERE created_at > {} ALLOW FILTERING", last_processed_str); // Query data if let Some(rows) = session .query(query, &[]) .await? .rows{ for row in rows{ println!("ID:"); if let Some(id_column) = row.columns.get(0) { if let Some(id) = id_column.as_ref().and_then(|col| col.as_uuid()) { println!("{}", id); } else { println!("(NULL)"); } } else { println!("Column not found"); } println!("Name:"); if let Some(name_column) = row.columns.get(1) { if let Some(name) = name_column.as_ref().and_then(|col| col.as_text()) { println!("{}", name); } else { println!("(NULL)"); } } else { println!("Column not found"); } // Update the last processed timestamp last_processed_time = SystemTime::now(); // Perform your data processing logic here } }; // Add a delay between iterations tokio::time::sleep(Duration::from_secs(10)).await; // Adjust the delay as needed } } Analyzer Rust #[tokio::main] async fn main() -> Result<(), Box<dyn Error>> { let uri = std::env::var("SCYLLA_CONTACT_POINTS") .unwrap_or_else(|_| "127.0.0.1:9042".to_string()); let session = SessionBuilder::new() .known_node(uri) .compression(Some(Compression::Snappy)) .build() .await?; let mut total_users = 0; let mut last_processed_time = SystemTime::now(); loop { // Calculate the last processed timestamp let last_processed_str = last_processed_time .duration_since(SystemTime::UNIX_EPOCH) .expect("Time went backwards") .as_secs() as i64; // Convert to i64 let query = format!( "SELECT id, name, created_at FROM ks.big_data_demo_table WHERE created_at > {} ALLOW FILTERING", last_processed_str); // Query data if let Some(rows) = session .query(query, &[]) .await? .rows{ for row in rows{ println!("ID:"); if let Some(id_column) = row.columns.get(0) { if let Some(id) = id_column.as_ref().and_then(|col| col.as_uuid()) { total_users += 1; if total_users > 0 { println!("Active Users {}, after adding recent user {}", total_users, id); } } else { println!("(NULL)"); } } else { println!("Column not found"); } println!("Name:"); if let Some(name_column) = row.columns.get(1) { if let Some(name) = name_column.as_ref().and_then(|col| col.as_text()) { println!("{}", name); } else { println!("(NULL)"); } } else { println!("Column not found"); } // Update the last processed timestamp last_processed_time = SystemTime::now(); // Perform your data processing logic here } }; // Add a delay between iterations tokio::time::sleep(Duration::from_secs(10)).await; // Adjust the delay as needed } } Now, Let's Run the Docker Compose Shell docker compose up -d Validate the POD state directly in VS Code with the Docker plugin. Let us attach the logs in VS Code. You should see the output below. Producer Processor Analyzer Summary A New Chapter in Big Data ProcessingIn the relentless pursuit of an ideal solution for big data processing, ScyllaDB emerges as a trailblazer that combines the lessons learned from past solutions with a forward-thinking approach. By reimagining the possibilities of performance, consistency, and language choice, ScyllaDB showcases how innovation can lead to a new era in the realm of big data. As technology continues to advance, ScyllaDB stands as a testament to the industry’s unwavering commitment to elevating the standards of data processing and setting the stage for a future where excellence is constantly redefined. That’s all for now, Happy Learning!
Over 100,000 organizations use Apache Kafka for data streaming. However, there is a problem: The broad ecosystem lacks a mature client framework and managed cloud service for Python data engineers. Quix Streams is a new technology on the market trying to close this gap. This blog post discusses this Python library, its place in the Kafka ecosystem, and when to use it instead of Apache Flink or other Python- or SQL-based substitutes. Why Python and Apache Kafka Together? Python is a high-level, general-purpose programming language. It has many use cases for scripting and development. But there is one fundamental purpose for its success: Data engineers and data scientists use Python. Period. Yes, there is R as another excellent programming language for statistical computing. And many low-code/no-code visual coding platforms for machine learning (ML). SQL usage is ubiquitous amongst data engineers and data scientists, but it’s a declarative formalism that isn’t expressive enough to specify all necessary business logic. When data transformation or non-trivial processing is required, data engineers and data scientists use Python. Hence, data engineers and data scientists use Python. If you don’t give them Python, you will find either shadow IT or Python scripts embedded into the coding box of a low-code tool. Apache Kafka is the de facto standard for data streaming. It combines real-time messaging, storage for true decoupling and replayability of historical data, data integration with connectors, and stream processing for data correlation. All in a single platform. At scale for transactions and analytics. Python and Apache Kafka for Data Engineering and Machine Learning In 2017, I wrote a blog post about “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka.” The article is still accurate and explores how data streaming and AI/ML are complementary: Machine Learning requires a lot of infrastructure for data collection, data engineering, model training, model scoring, monitoring, and so on. Data streaming with the Kafka ecosystem enables these capabilities in real-time, reliable, and at scale. DevOps, microservices, and other modern deployment concepts merged the job roles of software developers and data engineers/data scientists. The focus is much more on data products solving a business problem, operated by the team that develops it. Therefore, the Python code needs to be production-ready and scalable. As mentioned above, the data engineering and ML tasks are usually realized with Python APIs and frameworks. Here is the problem: The Kafka ecosystem is built around Java and the JVM. Therefore, it lacks good Python support. Let’s explore the options and why Quix Streams is a brilliant opportunity for data engineering teams for machine learning and similar tasks. What Options Exist for Python and Kafka? Many alternatives exist for data engineers and data scientists to leverage Python with Kafka. Python Integration for Kafka Here are a few common alternatives for integrating Python with Kafka and their trade-offs: Python Kafka client libraries: Produce and consume via Python. This is solid but insufficient for advanced data engineering as it lacks processing primitives, such as filtering and joining operations found in Kafka Streams and other stream processing libraries. Kafka REST APIs: Confluent REST Proxy and similar components enable producing and consuming to/from Kafka. It works well for gluing interfaces together but is not ideal for ML workloads with low latency and critical SLAs. SQL: Stream processing engines like ksqlDB or FlinkSQL allow querying of data in SQL. KsqlDB and Flink are other systems that need to be operated. And SQL isn’t expressive enough for all use cases. Instead of just integrating Python and Kafka via APIs, native stream processing provides the best of both worlds: The simplicity and flexibility of dynamic Python code for rapid prototyping with Jupyter notebooks and serious data engineering AND stream processing for stateful data correlation at scale either for data ingestion and model scoring. Stream Processing With Python and Kafka In the past, we had two suboptimal open-source options for stream processing with Kafka and Python: Faust: A stream processing library, porting the ideas from Kafka Streams (a Java library and part of Apache Kafka) to Python. The feature set is much more limited compared to Kafka Streams. Robinhood open-sourced Faust. But it lacks maturity and community adoption. I saw several customers evaluating it but then moving to other options. Apache Flink’s Python API: Flink’s adoption grows significantly yearly in the stream processing community. This API is a Python version of DataStream API, which allows Python users to write Python DataStream API jobs. Developers can also use the Table API, including SQL, directly in there. It is an excellent option if you have a Flink cluster and some folks want to run Python instead of Java or SQL against it for data engineering. The Kafka-Flink integration is very mature and battle-tested. As you see, all the alternatives for combining Kafka and Python have trade-offs. They work for some use cases but are imperfect for most data engineering and data science projects. A new open-source framework to the rescue? Introducing a brand new stream processing library for Python: Quix Streams… What Is Quix Streams? Quix Streams is a stream processing library focused on ease of use for Python data engineers. The library is open-source under Apache 2.0 license and available on GitHub. Instead of a database, Quix Streams uses a data streaming platform such as Apache Kafka. You can process data with high performance and save resources without introducing a delay. Some of the Quix Streams differentiators are defined as being lightweight and powerful, with no JVM and no need for separate clusters of orchestrators. It sounds like the pitch for why to use Kafka Streams in the Java ecosystem minus the JVM — this is a positive comment! :-) Quix Streams does not use any domain-specific language or embedded framework. It’s a library that you can use in your code base. This means that with Quix Streams, you can use any external library for your chosen language. For example, data engineers can leverage Pandas, NumPy, PyTorch, TensorFlow, Transformers, and OpenCV in Python. So far, so good. This was more or less the copy and paste of Quix Streams marketing (it makes sense to me)… Now, let’s dig deeper into the technology. The Quix Streams API and Developer Experience The following is the first feedback after playing around, doing code analysis, and speaking with some Confluent colleagues and the Quix Streams team. The Good The Quix API and tooling persona is the data engineer (that’s at least my understanding). Hence, it does not directly compete with other offerings, say a Java developer using Kafka Streams. Again, the beauty of microservices and data mesh is the focus of an application or data product per use case. Choose the right tool for the job! The API is mostly sleek, with some weirdness / unintuitive parts. But it is still in beta, so hopefully, it will get more refined in the subsequent releases. No worries at this early stage of the project. The integration with other data engineering and machine learning Python frameworks is excellent. If you can combine stream processing with Pandas, NumPy, and similar libraries is a massive benefit for the developer experience. The Quix library and SaaS platform are compatible with open-source Kafka and commercial offerings and cloud services like Cloudera, Confluent Cloud, or Amazon MSK. Quix’s commercial UI provides out-of-the-box integration with self-managed Kafka and Confluent Cloud. The cloud platform also provides a managed Kafka for testing purposes (for a few cents per Kafka topic and not meant for production). The Improvable The stream processing capabilities (like powerful sliding windows) are still pretty limited and not comparable to advanced engines like Kafka Streams or Apache Flink. The roadmap includes enhanced features. The architecture is complex since executing the Python API jumps through three languages: Python -> C# -> C++. Does it matter to the end user? It depends on the use case, security requirements, and more. The reasoning for this architecture is Quix’s background coming from the McLaren F1 team and ultra-low latency use cases and building a polyglot platform for different programming environments. It would be interesting to see a benchmark for throughput and latency versus Faust, which is Python top to bottom. There is a trade-off between inter-language marshaling/unmarshalling versus the performance boost of lower-level compiled languages. This should be fine if we trust Quix’s marketing and business model. I expect they will provide some public content soon, as this question will arise regularly. The Quix Streams Data Pipeline Low Code GUI The commercial product provides a user interface for building data pipelines and code, MLOps, and a production infrastructure for operating and monitoring the built applications. Here is an example: Tiles are K8’s containers. Each purple (transformation) and orange (destination) node is backed by a Git project containing the application code. The three blue (source) nodes on the left are replay services used to test the pipeline by replaying specific streams of data. Arrows are individual Kafka topics in Confluent Cloud (green = live data). The first visible pipeline node (bottom left) is joining data from different physical sites (see the three input topics; one was receiving data when I took the image). There are three modular transformations in the visible pipeline (two rolling windows and one interpolation). There are two real-time apps (one real-time Streamlit dashboard and the other is an integration with a Twilio SMS service). Quix Streams vs. Apache Flink for Stream Processing With Python The Quix team wrote a detailed comparison of Apache Flink and Quix Streams. I don’t think it’s an entirely fair comparison as it compares open-source Apache Flink to a Quix SaaS offering. Nevertheless, for the most part, it is a good comparison. Flink was always Java-first and has added support for Python for its DataStream and Table APIs at a later stage. On the contrary, Quix Streams is brand new. Hence, it lacks maturity and customer case studies. Having said all this, I think Quix Streams is a great choice for some stream processing projects in the Python ecosystem! Should You Use Quix Streams or Apache Flink? TL;DR: There is a place for both! Choose the right tool… Modern enterprise architectures built with concepts like data mesh, microservices, and domain-driven design allow this flexibility per use case and problem. I recommend using Flink if the use case makes sense with SQL or Java. And if the team is willing to operate its own Flink cluster or has a platform team or a cloud service taking over the operational burden and complexity. On the contrary, I would use Quix Streams for Python projects if I want to go to production with a more microservice-like architecture building Python applications. However, beware that Quix currently only has a few built-in stateful functions or JOINs. More advanced stream processing use cases cannot be done with Quix (yet). This is likely to change in the next months by adding more capabilities. Hence, make sure to read Quix’s comparison with Flink. But keep in mind if you want to evaluate the open-source Quix Streams library or the Quix SaaS platform. If you are in the public cloud, you might combine Quick Streams SaaS with other fully managed cloud services like Confluent Cloud for Kafka. On the other side, in your own private VPC or on-premise, you need to build your own platform with technologies like the Quix Streams library, Kafka or Confluent Platform, and so on. The Current State and Future of Quix Streams If you build a new framework or product for data streaming, you need to make sure that it does not overlap with existing established offerings. You need differentiators and/or innovation in a new domain that does not exist today. Quix Streams accomplishes this essential requirement to be successful: The target audience is data engineers with Python backgrounds. No severe and mature tool or vendor exists in this space today. And the demand for Python will grow more and more with the focus on leveraging data for solving business problems in every company. Maturity: Making the Right (Marketing) Choices in the Early Stage Quix Streams is in the early maturity stage. Hence, a lot of decisions can still be strengthened or revamped. The following buzzwords come into my mind when I think about Quix Streams: Python, data streaming, stream processing, Python, data engineering, Machine Learning, open source, cloud, Python, .NET, C#, Apache Kafka, Apache Flink, Confluent, MSK, DevOps, Python, governance, UI, time series, IoT, Python, and a few more. TL;DR: I see a massive opportunity for Quix Streams to become a great data engineering framework (and SaaS offering) for Python users. I am not a fan of polyglot platforms. It requires finding the lowest common denominator. I was never a fan of Apache Beam for that reason. The Kafka Streams community did not choose to implement the Beam API because of too many limitations. Similarly, most people do not care about the underlying technology. Yes, Quix Streams’ core is C++. But is the goal to roll out stream processing for various programming languages, only starting with Python, then going to .NET, and then to another one? I am skeptical. Hence, I would like to see a change in the marketing strategy already: Quix Streams started with the pitch of being designed for high-frequency telemetry services when you must process high volumes of time-series data with up to nanosecond precision. It is now being revamped to focus mainly on Python and data engineering. Competition: Friends or Enemies? Getting market adoption is still hard. Intuitive use of the product, building a broad community, and the right integrations and partnerships (can) make a new product such as Quix Streams successful. Quix Streams is on a good way here. For instance, integrating serverless Confluent Cloud and other Kafka deployments works well: This is a native integration, not a connector. Everything in the pipeline image runs as a direct Kafka protocol connection using raw TCP/IP packets to produce and consume data to topics in Confluent Cloud. Quix platform is orchestrating the management of the Confluent Cloud Kafka Cluster (create/delete topics, topic sizing, topic monitoring, etc.) using Confluent APIs. However, one challenge of these kinds of startups is the decision to complement versus compete with existing solutions, cloud services, and vendors. For instance, how much time and money do you invest in data governance? Should you build this or use the complementing streaming platform or a separate independent tool (like Collibra)? We will see where Quix Streams will go here —building its cloud platform for addressing Python engineers or overlapping with other streaming platforms. My advice is the proper integration with partners that lead in their space. Working with Confluent for over six years, I know what I am talking about: We do one thing: data streaming, but we are the best in that one. We don’t even try to compete with other categories. Yes, a few overlaps always exist, but instead of competing, we strategically partner and integrate with other vendors like Snowflake (data warehouse), MongoDB (transactional database), HiveMQ (IoT with MQTT), Collibra (enterprise-wide data governance), and many more. Additionally, we extend our offering with more data streaming capabilities, i.e., improving our core functionality and business model. The latest example is our integration of Apache Flink into the fully managed cloud offering. Kafka for Python? Look At Quix Streams! In the end, a data engineer or developer has several options for stream processing deeply integrated into the Kafka ecosystem: Kafka Streams: Java client library ksqlDB: SQL service Apache Flink: Java, SQL, Python service Faust: Python client library Quix Streams: Python client library All have their pros and cons. The persona of the data engineer or developer is a crucial criterion. Quix Streams is a nice new open-source framework for the broader data streaming community. If you cannot or do not want to use just SQL but native Python, then watch the project (and the company/cloud service behind it). bytewax is another open-source stream processing library for Python integrating with Kafka. It is implemented in Rust under the hood. I never saw it in the field yet. But a few comments mentioned it after I shared this blog post on social networks. I think it is worth a mention. Let’s see if it gets more traction in the following months.
The recent progress in AI is very exciting. People are using it in all sorts of novel ways, from improving customer support experiences and writing and running code to making new music and even accelerating medical imaging technology. But in the process, a worrying trend has emerged: the AI community seems to be reinventing the data movement (aka ELT). Whether they call them connectors, extractors, integrations, document loaders, or something else, people are writing the same code to extract data out of the same APIs, document formats, and databases and then load them into vector DBs or indices for their LLMs. The problem is that building and maintaining robust extraction and loading pipelines from scratch is a huge commitment. And there’s so much prior art in that area that for almost all engineers or companies in the AI space, it’s a huge waste of time to rebuild it. In a space where breaking news emerges approximately every hour, the main focus should be on making your core product incredible for your users, not going on side quests. And for almost everyone, the core product is not data movement; it’s the AI-powered magic sauce you’re brewing. A lot has been written about the challenges involved in building robust Extract, Transform, and Load (ETL) pipelines, but let’s contextualize it within AI. Why Does AI Need Data Movement? LLMs trained on public data are great, but you know what’s even better? AIs that can answer questions specific to us, our companies, and our users. We’d all love it if ChatGPT could learn our entire company wiki, read all of our emails, Slack messages, meeting notes, and transcripts, plug into our company’s analytics environment, and use all of these sources when answering our questions. Or, when integrating AI into our own product (for example, with Notion AI), we'd want our app’s AI model to know all the information we have about a user when helping them. Data movement is a prerequisite for all that. Whether you’re fine-tuning a model or using Retrieval-Augmented Generation (RAG), you need to extract data from wherever it lives, transform it into a format digestible by your model, and then load it into a datastore your AI app can access to serve your use case. The diagram above illustrates what this looks like when using RAG, but you can imagine that even if you’re not using RAG, the basic steps are unlikely to change: you need to Extract, Transform, and Load aka ETL, the data in order to build AI models which know non-public information specific to you and your use case. Why Is Data Movement Hard? Building a basic functional MVP for data extraction from an API or database is usually – though not always – doable on quick (<1 week) notice. The really hard part is making it production-ready and keeping it that way. Let’s look at some of the standard challenges that come to mind when building extraction pipelines. Incremental Extracts and State Management If you have any meaningful data volume, you’ll need to implement incremental extraction such that your pipeline only extracts the data it hasn’t seen before. To do this, you’ll need to have a persistence layer to keep track of what data each connection extracted. Transient Error Handling, Backoffs, Resume-On-Failure(s), Air Gapping Upstream data sources all the time, sometimes without any clear reason. Your pipelines need to be resilient to this and retry with the right backoff policies. If the failures are not-so-transient (but still not your fault), then your pipeline needs to be resilient enough to remember where it left off and resume from the same place once upstream is fixed. Sometimes, the problem coming from upstream is severe enough (like an API dropping some crucial fields from records) that you want to pause the whole pipeline altogether until you examine what’s happening and manually decide what to do. Identifying and Proactively Fixing Configuration Errors Suppose you’re building data extraction pipelines to extract your customers’ data. In that case, you’ll need to implement some defensive checks to ensure that all the configuration your customers gave you to extract data on their behalf is correct, and if they’re not, quickly give them actionable error messages. Most APIs do not make this easy because they don’t publish comprehensive error tables, and even when they do, they rarely give you endpoints that you can use to check the permissions assigned to, e.g., API tokens, so you have to find ways to balance comprehensive checks with quick feedback for the user. Authentication and Secret Management APIs range in simplicity from simple bearer token auth to “creative” implementations of session tokens or single-use-token OAuth. You’ll need to implement the logic to perform the auth as well as manage the secrets, which may be getting refreshed once an hour, potentially coordinating secret refreshes across multiple concurrent workers. Optimizing Extract and Load Speeds, Concurrency, and Rate Limits And speaking of concurrent workers, you’ll likely want to implement concurrency to achieve high throughput for your extractions. While this may not matter on small datasets, it’s absolutely crucial on larger ones. Even though APIs publish official rate limits, you’ll need to empirically figure out the best parallelism parameters for maxing out the rate limit provided to you by the API without getting IP blacklisted or forever-rate-limited. Adapting to Upstream API Changes APIs change and take on new undocumented behaviors or quirks all the time. Many vendors publish new API versions quarterly. You’ll need to keep an eye on how all these updates may impact your work and devote engineering time to keep it all up to date. New endpoints come up all the time, and some change their behavior (and you don’t always get a heads-up). Scheduling, Monitoring, Logging, and Observability Beyond the code that extracts data from specific APIs, you’ll also likely need to build some horizontal capabilities leveraged by all of your data extractors. You’ll want some scheduling as well as logging and monitoring for when the scheduling doesn’t work or when other things go wrong, and you need to go investigate. You also likely want some observability, such as how many records were extracted yesterday, today, last week, etc., and which API endpoints or database tables they come from. Data Blocking or Hashing Depending on where you’re pulling data from, you may need some privacy features for either blocking or hashing columns before sending them downstream. To be clear, the above does not apply if you just want to move a few files as a one-time thing. But it does apply when you’re building products that require data movement. Sooner or later, you’ll need to deal with most of these concerns. And while no single one of them is impossible rocket science, taken together, they can quickly add up to one or multiple full-time jobs, more so the more data sources you’re pulling from. And that’s exactly the difficulty with maintaining data extraction and pipelines: the majority of its cost comes from the continuous incremental investment needed to keep those pipelines functional and robust. For most AI engineers, that’s just not the job that adds the most value to their users. Their time is best spent elsewhere. So, What Does an AI Engineer Have To Do To Move Some Data Around Here? If you ever find yourself in need of data extraction and loading pipelines, try the solutions already available instead of automatically building your own. Chances are they can solve a lot, if not all, of your concerns. If not, build your own as a last resort. And even if existing platforms don’t support everything you need, you should still be able to get most of the way there with a portable and extensible framework. This way, instead of building everything from scratch, you can get 90% of the way there with off-the-shelf features in the platform and only build and maintain the last 10%. The most common example is long-tail integrations: if the platform doesn’t ship with an integration to an API you need, then a good platform will make it easy to write some code or even a no-code solution to build that integration and still get all the useful features offered by the platform. Even if you want the flexibility of just importing a connector as a Python package and triggering it however you like from your code, you can use one of the many open-source EL tools like Airbyte or Singer connectors. To be clear, data movement is not completely solved. There are situations where existing solutions genuinely fall short, and you need to build novel solutions. However, this is not the majority of the AI engineering population. Most people don’t need to rebuild the same integrations with Jira, Confluence, Slack, Notion, Gmail, Salesforce, etc., over and over again. Let’s just use the solutions that have already been battle-tested and made open for anyone to use so we can get on with adding the value our users actually care about.
When faced with the need to pass information asynchronously between components, there are two main technology choices: messaging and event streaming. While they share many common features, there are fundamental differences that make it crucial to choose correctly between the two. Messaging grew from the need for assured, decoupled delivery of data items across heterogeneous platforms and challenging networks. In contrast, event streaming provides a historical record of events for subscribers to peruse. The most common event streaming platform is the open-source Apache Kafka. A core difference between messaging and event streaming is that messages are destroyed when read, whereas event streaming retains an event history. But while this may help us understand the technical difference, we’ll need to dig a bit deeper to understand the usage difference. Publish/Subscribe Implementations Are Subtly Different, and the Differences Matter Messaging excels with dynamically changing and scalable consumer groups (a set of consumers that work together to read a topic/queue) and enables fine-grained, hierarchical, and dynamic topics. Publish/subscribe over topics is just one of the multiple interaction patterns that can be achieved with messaging. Kafka excels when there are large numbers of subscribers to all events on the topic since it works from one unchanging copy of the data (the event history) and partitions topic data across disks to enable parallelism. However, you need to be at very high volumes of events and huge numbers of consumers (100+) for this actually to matter. Note that Kafka always uses a publish/subscribe interaction pattern. Don’t Use Publish/Subscribe To Perform Request/Response Patterns Messaging enables multiple interaction patterns beyond publish/subscribe and is often used for request/response. Since Kafka only performs the publish/subscribe interaction pattern, it should not be used for request/response. While it may be technically possible to use publish/subscribe for request/response, it tends to lead to unnecessary complexity and/or poor performance at scale. Message Queuing Offers Granular Deployment, Whereas the Event Stream History Requires a Significant Minimum of Infrastructure Messaging can be deployed on independent compute resources, as small as a fraction of a CPU. As such, it can be deployed alongside an application and in low infrastructure landscapes, such as retail stores, or even directly on devices. Due to its stream history, Kafka is a multi-component, storage-focused infrastructure. As a result, Kafka is typically deployed as a shared capacity with a significant minimum footprint and more complex operational needs. The Overhead of Kafka’s Stream History Is Worth Accommodating if You Have the Right Use Case Kafka’s stream history enables three key interaction patterns that do not come out of the box with messaging. The first is ‘replay,’ which is able to replay events for testing, re-inflation of a cache, or data projection. This underpins application patterns such as event sourcing and CQRS. The second is ‘stream processing,’ which is the ability to continuously analyze all or part of the stream at once, discovering patterns within it. The third is the ‘persisted audit log.’ Since the event history is immutable, the event data cannot be changed and can only be deleted via archiving. Individual Message Delivery and Mass Event Transmission Are Opposing Use Cases Messaging uses queues to store pieces of data (i.e., messages) that are to be discretely consumed by the target system. It’s a good fit when it’s important to know whether you have consumed each individual message. It excels at “exactly once” delivery, which makes it very strong for transactional integrity. On the other hand, Kafka provides events as a log shared by all consumers who consume such events. It provides “at least once” delivery and is at its most efficient when consumers poll for arrays of records to process together. This makes Kafka a good choice for enabling a large number of consumers to consume events at high rates. Consider Where and How Data Integrity Is Handled Messaging enables transactional delivery of each message such that they appear to the consumer as a single ACID (atomicity, consistency, isolation, durability) interaction. Kafka pushes integrity concerns to the consumer (and provider to some extent), requiring a more complex interaction to ensure integrity. This is less of a concern where target systems are inherently idempotent (i.e., have the ability to ignore duplicate requests), and there is a community of pre-written source and sink connectors that encapsulate some of that complexity. Wrapping It Up In reality, we often see these technologies used in combination, each fulfilling the patterns they are designed for. It may help to distinguish between messaging and event streaming by considering that, whereas queue-based messaging provides an asynchronous transport, Kafka is more of an alternative type of data store. There are clearly limitations in that analogy, but it does help to separate the primary purpose of each technology. Messaging ensures that messages are eventually delivered to their target applications (regardless of the reliability of the intervening network), whereas event streaming provides a historical record of the events that have occurred. Despite their superficial similarities, the unique characteristics and requirements of each technology make them each important to an enterprise integration strategy.
Data engineers discovered the benefits of conscious uncoupling around the same time as Gwyneth Paltrow and Chris Martin in 2014. Of course, instead of life partners, engineers were starting to gleefully decouple storage and compute with emerging technologies like Snowflake (2012), Databricks (2013), and BigQuery (2010). This had amazing advantages from both a cost and scale perspective compared to on-premises databases. A data engineering manager at a Fortune 500 company expressed the pain of on-prem limitations to me by saying: “Our analysts were unable to run the queries they wanted to run when they wanted to run them. Our data warehouse would be down for 12 hours a day because we would regularly need to take it offline for data transformations and loading…The only word I can use to describe this process is painful.” A decade later, a considerable amount of innovation in the data management industry is revolving around how different data platforms are coupling or decoupling storage and compute (don’t worry I include examples in the next section). Closely related to this is how those same platforms are bundling or unbundling related data services from data ingestion and transformation to data governance and monitoring. Why are these things related, and more importantly, why should data leaders care? Well, the connective tissue that powers and integrates these services is frequently found in the metadata of table formats (storage) and query/job logs (compute). How these aspects are managed within your data platform will play an outsized role in its performance, cost, ease of use, partner ecosystem, and future viability. Asking what type of data platform and what level of decoupling is correct is like asking what is the right way to format your SQL code: It is largely going to depend on personal preferences and professional requirements; however, there is a small range of possibilities that will satisfy most. I believe — at this current moment — that range for data platforms follows Aristotle’s golden mean. The majority will be best served by options in the middle of the spectrum, while operating at either extreme will be for the select few with very specialized use cases. Let’s dive into the current landscape and recent evolutions before looking closer at why that may be the case. The Storage and Compute Data Platform Spectrum A loud few have made headlines with a “cloud is expensive let’s go back to our server racks” movement. While that may be a legitimate strategy for some, it is a rapidly dwindling minority. Just last week, the Pragmatic Engineer shone a spotlight on Twitter’s rate throttling and significant user experience issues that likely resulted from their moving machine learning models off of GCP and relying exclusively on their three data centers. The ability to scale and consume storage and compute independently is much more cost effective and performative, but having these separate functions within the same data platform has advantages as well. The average unoptimized SQL query executed as part of an ad hoc analytics request will typically run a lot faster in these platforms that have been tuned to work well out of the box. A more decoupled architecture that separates compute and storage at the platform level can be very cost effective in running heavy workloads, but that assumes a highly trained staff spends the time optimizing those workloads. Data platforms with combined but decoupled storage and compute also provide a more robust, integrated user experience across several key DataOps tasks. Take data governance, for example. These platforms provide a centralized mechanism to do access control, whereas decoupled architectures require federating roles across multiple query engines — a non-trivial task. The decoupled but combined approach is the magic that made one of the most common reviews of Snowflake that “Everything just works.” It’s no wonder Snowflake recently doubled down with Unistore for transactional workloads and opened up Snowpark to support Python and more data science (compute) workloads. Databricks saw amazing growth with its focus on its Spark processing framework, but it’s also not a coincidence that it unlocked a new level of growth after adding metadata and ACID-like transactions within Delta tables and governance features within Unity Catalog. They also recently doubled down and made it so when you write to a Delta table (storage) the metadata within that table is written in formats readable by Delta, Apache, and Hudi. Challengers' Platforms This is why it’s interesting to see that many of the latest emerging data engineering technologies are starting to separate storage and compute at the vendor level. For example, Tabular describes itself as a “headless data warehouse” or “everything you need in a data warehouse except compute.” Going a step further, some organizations are migrating to Apache Iceberg tables within a data lake with “do it yourself” management of backend infrastructure and using a separate query engine like Trino. This is most commonly due to customer-facing use cases that require highly performant and cost-effective, interactive queries. DuckDB combines storage and compute, but sacrifices the near infinite compute of the modern data stack, in favor of developer simplicity and reduced cost. So the question remains, are these innovations likely to replace incumbent cloud native data platforms? Again, that answer will depend on who you are. DuckDB is an incredibly popular tool that many data analysts love, but it’s probably not going to be the rock upon which you build your data platform. Ultimately, we are seeing, and I believe will continue to see a distribution that looks like this: I’ll explain why by looking across the modern data stack and data platform types across a couple of key dimensions. Degree and Purpose of Consolidation B2B vendors reverently reference the “single pane of glass” all the time. Is there value in having multiple services under a single umbrella? It depends on the quality of each service and how it corresponds to your needs. The value in the single pane of glass really comes from unifying what would otherwise be siloed information into a single story, or separate actions into what should be a single workflow. Let’s use Microsoft 365 as an example of this concept. It’s valuable having video and email integrated within their Teams collaboration application as those are core aspects of the meeting scheduling and video conferencing process. Is it as valuable to have Sway within their suite of apps? Again, it goes back to your requirements for interactive reporting. Coming back to the data universe, compute and storage are vital to that single story (the who, what, when, where, why, how of DataOps) and key workflows for aspects like: cost, quality, and access management. For this reason, these platforms will have the most robust partner ecosystems and more seamless integrations. This will likely be a key criterion for you unless you are the type of person who was using Windows and Fire phones rather than iPhones and Androids. Adam Woods, CEO of Choozle, spoke to our team last year about the importance he places on having a robust and tightly integrated partner ecosystem surround his data platform. “I love that…my data stack is always up-to-date and I never have to apply a patch. We are able to reinvest the time developers and database analysts would have spent worrying about updates and infrastructure into building exceptional customer experiences,” he said. Of course, there are always exceptions to the rule. A true data lake, headless warehouse, or other more complex platform may be ideal if you have edge cases at a large scale. Should your semantic layer, data quality, access control, catalog, BI, transformation, ingestion, and other tools all be bundled within the same platform? I think there are valid perspectives across this spectrum, but like every other department, most data teams will have a collection of tools that best fit their requirements. Takeaway Most data leaders will want to prioritize using a data platform that has both compute and storage services that can facilitate a “single story” and support a diverse partner ecosystem. Performance vs. Ease of Use Generally speaking, the more customizable a platform, the better it can perform across a wide array of use cases…and the harder it is to use. This is a pretty inescapable tradeoff and one you are making when you separate the storage and compute services across vendors. When thinking about data platform “ease of use,” it’s helpful to consider not only the day-to-day usage of the platform but the simplicity of administration and customization as well. In my experience, many teams overfit for platform performance. Our technical backgrounds immediately start comparing platforms like they are cars: “What’s the horsepower for this workload? What about that workload?” Don’t get me wrong, an optimized data platform can translate into millions of annual savings. It’s important. But, if you are hiring extra engineers to manage S3 configurations or you need to launch a multi-month project every quarter to onboard new aspects of the business onto your data platform, that has a high cost as well. You see this same decision making paradigm playout with open-source solutions. The upfront cost is negligible, but the time cost of maintaining the infrastructure can be substantial. Solution costs and engineering salary costs are not the same, and this false equivalency can create issues down the road. There are two reasons why: Assuming your usage remains static (an important caveat), your solution costs generally stay the same while efficiency goes up. That’s because SaaS vendors are constantly shipping new features. On the other hand, the efficiency of a more manual implementation will decline over time as key players leave and new team members need to be onboarded. When you spend most of your time maintaining infrastructure, your data team starts to lose the plot. The goal slowly drifts from maximizing business value to maintaining infrastructure at peak performance. More meetings become about infrastructure. Niche infrastructure skills take on an outsized importance and these specialists become more prominent within the organization. Organizational culture matters and it is often shaped by the primary tasks and problems the team is solving. This second point was a particular emphasis for Michael Sheldon, head of data at Swimply. “Because we had this mandate as a data team to support the entire company, we needed a data stack that could solve two central issues,” Michael said. “One, to centralize all of the data from all of the different parts of the company in one stable place that everyone could use and refer to as a source of truth. And two, to enable us to have enough time to really focus on the insights and not just the data infrastructure itself.” Of course, there will be times when your business use case calls for premium performance. A credit card fraud data product with high latency is just a waste of time. A customer-facing app that gives users the spinning wheel of death will be unacceptable, likely requiring you to deploy a high-performing query engine. However in most cases, your data warehouse or managed data lakehouse will scale just fine. Double-check any requirements that say otherwise. Takeaway While ease of use and performance are interrelated variables that must be balanced, most data leaders will want to have a bias towards ease of use due to relatively hidden maintenance and culture costs. Your competitive advantage is more frequently found in enriching and applying first-party data rather than maintaining complex infrastructure. In Defense of the MDS I know it’s fashionable to bash the modern data stack (and you might not need it), but for all its faults, it will be the best choice for the majority of data teams. It’s an ideal blend of quick value generation and future proofing a long-term investment. Many of the emerging technologies have significant value albeit with more narrow use cases. It will be exciting to see how they evolve and shape the practice of data engineering. However, while compute and storage need to operate and scale separately, having those services and corresponding metadata within the same platform is just too powerful with too many advantages to ignore.
Several tools/scripts are included in the bin directory of the Apache Kafka binary installation. Even if that directory has a number of scripts, through this article, I want to highlight the five scripts/tools that I believe will have the biggest influence on your development work, mostly related to real-time data stream processing. After setting up the development environment, followed by installation and configuration of either with single-node or multi-node Kafka cluster, the first built-in script or tool is kafka-topic.sh. Kafka-topics.sh Using kafka-topic.sh, we can create a topic on the cluster to publish and consume some test messages/data. Subsequently, alter the created topic, delete a topic, change/modify the replication factor, number of partitions for that topic, list of topics created on the cluster, etc, and so on. Kafka-console-producer.sh We may directly send records to a topic from the command line using this console producer script. Typically, this script is an excellent method to test quickly whether the dev or staging Kafka single/multi-node setup is working or not. Also very helpful to quickly test new consumer applications when we aren’t producing records to the topics yet by integrating messages/records senders apps or IoT devices etc. TypeScript kafka-console-producer --topic <<topic name>> / --broker-list <broker-host:port> Even though the above method of using the command line producer just sends values rather than any keys, using the key.separator, we could send full key/value pairs too. TypeScript kafka-console-producer –-topic <topic name>\ –-broker-list <broker-host:port> \ –-property "parse.key=true" \ –-property "key.separator=:" Kafka-console-consumer.sh Now let’s look at record consumption from the command line, which is the opposite side of the coin. We may directly consume records from a Kafka topic using the console consumer’s command-line interface. Quickly starting a consumer may be a very useful tool for experimenting or troubleshooting. Run the below command to rapidly verify that our producer application is delivering messages to the topic. To see all the records from the start, we can add a –from-beginning flag to the command, and we’ll see all records produced to that topic. TypeScript kafka-console-consumer -–topic <topic name> \ -–bootstrap-server <broker-host:port> \ -–from-beginning The plain consumers work with records of primitive Java types: String, Long, Double, Integer, etc. The default format expected for keys and values by the plain console consumer is the String type. We have to pass the fully qualified class names of the appropriate deserializers using the command line flags –key-deserializer and –value-deserializer if the keys or values are not strings. By default, the console consumer only prints the value component of the messages to the screen. If we want to see the keys as well, need to append the following flags: TypeScript kafka-console-consumer -–topic <topic name>\ -–bootstrap-server <broker-host:port> \ -–property "print.key=true" -–property "key.separator=:" --from-beginning Kafka-dump-log.sh The kafka-dump-log command is our buddy whether we just want to learn more about Kafka’s inner workings or we need to troubleshoot a problem and validate the content. By leveraging kafka-dump-log.sh, we can manually inspect the underlying logs of a topic. TypeScript kafka-dump-log \ -–print-data-log \ -–files ./usr/local/kafka-logs/FirstTopic-0/00000000000000000000.log For a full list of options and a description of what each option does, run kafka-dump-log with the –help flag. –print-data-log flag specifies to print the data in the log. –files flag is required. This could also be a comma-separated list of files. Each message’s key, payload (value), offset, and timestamp are easily visible in the dump log. Henceforth, lots of information that are available in the dump-log and can be extracted for troubleshooting, etc. Please note that in the captured screenshot, the keys and values for the topic FirstTopic are strings. To run the dump-log tool with key or value types other than strings, we’ll need to use either the –key-decoder-class or the –value-decoder-class flags. Kafka-delete-records.sh As we know, Kafka stores records for topics on disk and retains that data even once consumers have read it. Instead of being kept in a single large file, records are divided up into partition-specific segments where the offset order is continuous across segments for the same topic partition. As servers do not have infinite amounts of storage, using Kafka’s configuration settings, we can control how much data can be retained based on time and size. In the server.properties file using log.retention.hours, which defaults to 168 hours (one week), the time configuration for controlling data retention can be achieved. And another property log.retention.bytes control how large segments can grow before they are eligible for deletion. By default, log.retention.bytes is commented in the server.properties file. Using log.segment.bytes, the maximum size of a log segment file is defined. The default value is 1073741824, and when this size is reached, a new log segment will be created. Typically we never want to go into the filesystem and manually delete files. Typically we never want to go into the filesystem and manually delete files. In order to save space, we would prefer a controlled and supported method of deleting records from a topic. Using Kafka-delete-records.sh, we can delete data as desired. To execute Kafka-delete-records.sh on the terminal, two mandatory parameters are required. –bootstrap-server: the broker(s) to connect to for bootstrapping –offset-json-file: a JSON file containing the deletion settings Here’s an example of the JSON file: As we can see above, the format of the JSON is simple. It’s an array of JSON objects. Each JSON object has three properties: Topic: The topic to delete from Partition: The partition to delete from Offset: The offset we want the delete to start from, moving backward to lower offsets For this example, I’m reusing the same topic, FirstTopicfrom the dump-log tool, so it’s a very simple JSON file. If we had more partitions or topics, we would simply expand on the JSON config file above. We could simply determine the beginning offset to begin the deletion process because the example topic FirstTopic only has 17 records. But in actual use, we’ll probably be unsure what offset to use. If we provide -1, then the offset of the high watermark is used, which means we will be deleting all the data currently in the topic. The high watermark is the highest available offset for consumption. After executing the following command on the terminal, we can see the following TypeScript kafka-delete-records -–bootstrap-server <broker-host:port> \ --offset-json-file offsets.json The results of the command show that Kafka deleted all records from the topic partition FirstTopic-0. The low_watermark value of 17 indicates the lowest offset available to consumers. Because there were only 17 records in the FirstTopic topic. In a nutshell, this script is helpful when we want to reset the consumer without stopping and restarting the Kafka broker cluster to flush a Kafka topic if receives bad data. This script allows us to delete all the records from the beginning of a partition, until the specified offset. Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.
Business process automation with a workflow engine or BPM suite has existed for decades. However, using the data streaming platform Apache Kafka as the backbone of a workflow engine provides better scalability, higher availability, and simplified architecture. This blog post explores the concepts behind using Kafka for persistence with stateful data processing and when to use it instead or with other tools. Case studies across industries show how enterprises like Salesforce or Swisscom implement stateful workflow automation and orchestration with Kafka. What Is a Workflow Engine for Process Orchestration? A workflow engine is a software application that orchestrates human and automated activities. It creates, manages, and monitors the state of activities in a business process, such as the processing and approval of a claim in an insurance case, and determines the extra activity it should transition to according to defined business processes. Sometimes, such software is called an orchestration engine instead. Workflow engines have distinct faces. Some people mainly focus on BPM suites. Others see visual coding integration tools like Dell Boomi as a workflow and orchestration engine. For data streaming with Apache Kafka, the Confluent Stream Designer enables building pipelines with a low code/no code approach. And Business Process Management (BPM)? A workflow engine is a core business process management (BPM) component. BPM is a broad topic with many definitions. From a technical perspective, when people talk about workflow engines or orchestration and automation of human and automated tasks, they usually mean BPM tools. The Association of Business Process Management Professionals defines BPM as: "Business process management (BPM) is a disciplined approach to identify, design, execute, document, measure, monitor, and control both automated and non-automated business processes to achieve consistent, targeted results aligned with an organization’s strategic goals. BPM involves the deliberate, collaborative, and increasingly technology-aided definition, improvement, innovation, and management of end-to-end business processes that drive business results, create value, and enable an organization to meet its business objectives with more agility. BPM enables an enterprise to align its business processes to its business strategy, leading to effective overall company performance through improvements of specific work activities either within a specific department, across the enterprise, or between organizations." Tools for Workflow Automation: BPMS, ETL, and iPaaS A BPM suite (BPMS) is a technological suite of tools designed to help BPM professionals accomplish their goals. I had a talk ~10 years ago at ECSA 2014 in Vienna: "The Next-Generation BPM for a Big Data World: Intelligent Business Process Management Suites (iBPMS)." Hence, you see how long this discussion has existed already. I worked for TIBCO and used products like ActiveMatrix BPM and TIBCO BusinessWorks. Today, most people use open-source engines like Camunda or SaaS from cloud service providers. Most proprietary legacy BPMS I saw ten years ago do not have much presence in the enterprises anymore. And many relevant vendors today don't use the term BPM anymore for tactical reasons :-) Some BPM vendors like Camunda also moved forward with fully managed cloud services or new (more cloud-native and scalable) engines like Zeebe. As a side note: I wish Camunda had built Zeebe on top of Kafka instead of building its own engine with similar characteristics. They must invest so much into scalability and reliability instead of focusing on a great workflow automation tool. Not sure if this pays off. Traditional ETL and data integration tools fit into this category, too. Their core function is to automate processes (from a different angle). Cloud-native platforms like MuleSoft or Dell Boomi are called Integration Platform as a Service (iPaaS). I explored the differences between Kafka and iPaaS in a separate article. This Is Not Robotic Process Automation (RPA)! Before I look at the relationship between data streaming with Kafka and BPM workflow engines, it is essential to separate another group of automation tools: Robotic process automation (RPA). RPA is another form of business process automation that relies on software robots. This automation technology is often seen as artificial intelligence (AI), even though it is usually a more uncomplicated automation of human tasks. In traditional workflow automation tools, a software developer produces a list of actions to automate a task and interface to the backend system using internal APIs or dedicated scripting language. In contrast, RPA systems develop the action list by watching the user perform that task in the application's graphical user interface (GUI) and then perform the automation by repeating those tasks directly in the GUI. This can lower the barrier to using automation in products that might not otherwise feature APIs for this purpose. RPA is excellent for legacy workflow automation that requires repeating human actions with a GUI. This is a different business problem. Of course, overlaps exist. For instance, Gartner's Magic Quadrant for RPA does not just include RPA vendors like UiPath but also traditional BPM or integration vendors like Pegasystems or MuleSoft (Salesforce) that move into this business. RPA tools integrate well with Kafka. Besides that, they are out of the scope of this article as they solve a different problem than Kafka or BPM workflow engines. Data Streaming With Kafka and BPM Workflow Engines: Friends or Enemies? A workflow engine, respectively a BPM suite, has different goals and sweet spots compared to data streaming technologies like Apache Kafka. Hence, these technologies are complementary. No surprise that most BPM tools added support for Apache Kafka (the de facto standard for data streaming) instead of just supporting request-response integration with web service interfaces like HTTP and SOAP/WSDL, similar to any ETL tool and Enterprise Service Bus (ESB) has Kafka connectors today. This article explores specific case studies that leverage Kafka for stateful business process orchestration. I no longer see BPM and workflow engines kicking off many new projects. Many automation tasks are done with data streaming instead of adding another engine to the stack "just for business process automation." Still, while the techniques overlap, they are complementary. Hence, I would call it frenemies. When To Use Apache Kafka Instead of BPM Suites/Workflow Engines? To be very clear initially: Apache Kafka cannot and does not replace BPM engines! Many nuances must be evaluated to choose the right tool or combination. And I still see customers using BPM tools and integrating them with Kafka. Camunda did a great job similar to Confluent, bringing the open source core into an enterprise solution and, finally, a fully managed cloud service. Hence, this is the number one BPM engine I see in the field; but it is not like I see it every week in customer meetings. Kafka as the Stateful Data Hub and BPM for Human Interaction So, from my 10+ years of experience with integration and BPM engines, here is my guide for choosing the right tool for the job: Data streaming with Kafka became the scalable real-time data hub in most enterprises to process data and decouple applications. So, this is often the starting point for looking at a new business process's automation, whether it is a small or high volume of data, no matter if transactional or analytical data. If your business process is complex and cannot easily be defined or modeled without BPMN, go for it. BPM engines provide visual coding and deployment/monitoring of the automated process. But don't model every single business process in the enterprise. In a very agile world, time-to-market and changing business strategies all the time is not easier if too much time is spent on (outdated) diagrams. That's not the goal of BPM, of course. But I have seen too many old, inaccurate BPMN or UML visualizations in customer meetings. If you don't have Kafka yet and need just some automation of a few processes, you might start with a BPM workflow engine. However, both data streaming and BPM are strategic platforms. A simple tool like IFTTT ("If This Then That") might do the job well for automating a few workflows with human interaction in a single domain. The sweet spot of a process orchestration tool is a business process that requires automated and manual tasks. The BPM workflow engine orchestrates complex human interaction in conjunction with automated processes (whether streaming/messaging interfaces, API calls, or file processing). Process orchestration tools support developers with pleasant features for generating UIs for human forms in mobile apps or web browsers. Data Streaming as the Stateful Workflow Engine or With Another BPM Tool? TL;DR: Via Kafka as the data hub, you already access all the data on other platforms. Evaluate per use case and project if tools like Kafka Streams can implement and automate the business process or if a dedicated workflow engine is better. Both integrate very well. Often, a combination of both is the right choice. Saying Kafka is too complex does not count anymore (in the cloud), as it is just one click away via fully managed and supported pay-as-you-go with SaaS offerings like Confluent Cloud. In this blog, I want to share a few stories where the stateful workflows were automated natively with data streaming using Kafka and its ecosystem, like Kafka Connect and Kafka Streams. To learn more about BPM case studies, look at Camunda and similar vendors. They provide excellent success case studies, and some use cases combine the workflow engine with Kafka. Apache Kafka as Workflow and Orchestration Engine Apache Kafka focuses on data streaming, i.e., real-time data integration and stream processing. However, the core functionality of Kafka includes database characteristics like persistence, high availability, guaranteed ordering, transaction APIs, and (limited) query capabilities. Kafka is some kind of database but does not try to replace others like Oracle, MongoDB, Elasticsearch, et al. Apache Kafka as Stateful and Reliable Persistence Layer Some of Kafka's functionalities enable implementing the automation of workflows. This includes stateful processing: The distributed commit log provides persistence, reliability, and guaranteed ordering. Tiered Storage (only available via some vendors or cloud services today) makes long-term storage of big data cost-efficient and scalable. The replayability of historical data is crucial to rewind events in case of a failure or other issues within the business process. Timestamps of events, in combination with a clever design of the Kafka Topics and related Partitions, allow the separation of concerns. While a Kafka topic is the source of truth for the entire history with guaranteed ordering, a compacted topic can be used for quick lookups of the current, updated state. This combination enables the storage of information forever and updates to existing information with the updated state. Plus, querying the data via key/value request. Stream processing with Kafka Streams or KSQL allows keeping the current state of a business process in the application, even long-term, over months or years. Interactive queries on top of the streaming applications allow API queries against the current state in microservices (as an application-centric alternative to using compacted topics). Kafka Connect with connectors to cloud-native SaaS and legacy systems, clients for many programming languages (including Java, Python, C++, or Go), and other API interfaces like the REST Proxy integrate with any data source or data sink required for business process automation. The Schema Registry ensures the enforcement of API contracts (= schemas) in all event producers and consumers. Some Kafka distributions and cloud services add data governance, policy enforcement, and advanced security features to control data flow and privacy/compliance concerns. The Saga Design Pattern for Stateful Orchestration With Kafka Apache Kafka supports exactly-once semantics and transactions in Kafka. Transactions are often required for automating business processes. However, a scalable cloud-native infrastructure like Kafka does not use XA transactions with the two-phase commit protocol (like you might know from Oracle and IBM MQ) if separate transactions in different databases need to be executed. This does not scale well. Therefore, this is no option. Learn more about transactions in Kafka (including the trade-offs) in a dedicated blog post: "Apache Kafka for Big Data Analytics AND Transactions." Sometimes, another alternative is needed for workflow automation with transaction requirements. Welcome to a merging design pattern that originated with microservice architectures: The SAGA design pattern is a crucial concept for implementing stateful orchestration if one transaction in the business process spans multiple applications or service calls. SAGA pattern enables an application to maintain data consistency across multiple services without using distributed transactions. Instead, the SAGA pattern uses a controller that starts one or more "transactions" (= regular service calls without transaction guarantee) and only continues after all expected service calls return the expected response: Source: microservices.io The Order Service receives the POST /orders request and creates the Create Order saga orchestrator The saga orchestrator creates an Order in the PENDING state It then sends a Reserve Credit command to the Customer Service The Customer Service attempts to reserve credit It then sends back a reply message indicating the outcome The saga orchestrator either approves or rejects the Order Find more details from the above example at microservices.io's detailed explanation of the SAGA pattern. Case Studies for Workflow Automation and Process Orchestration With Apache Kafka Reading the above sections, you should better understand Kafka's persistence capabilities and the SAGA design pattern. Many business processes can be automated at scale in real time with the Kafka ecosystem. There is no need for another workflow engine or BPM suite. Learn from interesting real-world examples of different industries: Digital Native: Salesforce Financial Services: Swisscom Insurance: Mobiliar Open Source Tool: Kestra Salesforce: Web-Scale Kafka Workflow Engine Salesforce built a scalable real-time workflow engine powered by Kafka. Why? The company recommends to "use Kafka to make workflow engines more reliable." At Current 2022, Salesforce presented their project for workflow and process automation with a stateful Kafka engine. I highly recommend checking out the entire talk or at least the slide deck. Salesforce introduced a workflow engine concept that only uses Kafka to persist state transitions and execution results. The system banks on Kafka’s high reliability, transactionality, and high scale to keep setup and operating costs low. The first target use case was the Continuous Integration (CI) systems. Source: Salesforce A demo of the system presented: Parallel and nested CI workflow definitions of varying declaration formats Real-time visualization of progress with the help of Kafka. Chaos and load generation to showcase how retries and scale-ups work. Extension points Contrasting the implementation to other popular workflow engines Here is an example of the workflow architecture: Source: Salesforce Compacted Topics as the Backbone of the Stateful Workflow Engine TL;DR of Salesforce's presentation: Kafka is a viable choice as the persistence layer for problems where you have to do state machine processing. A few notes on the advantages Salesforce pointed out for using Kafka instead of other databases or CI/CD tools: The only stateful component is Kafka -> Dominating reliability setup Kafka was chosen as the persistence layer -> SLAs / reliability better than with a database/sharded Jenkings/NoSQL -> Four nines (+ horizontal scaling) instead of three nines (see slide "reliability contrast") Restart K8S clusters and CI workflows again easily Kafka State Topic -> Store the full workflow graph in each message (workflow Protobuf with defined schemas) Compacted Topic updates states: not started, running, complete -> Track, manage, and count state machine transitions The Salesforce implementation is open source and available on Github: "Junction Workflow is an upcoming workflow engine that uses compacted Kafka topics to manage state transitions of workflows that users pass into it. It is designed to take multiple workflow definition formats." Swisscom Custodigit: Secure Crypto Investments With Stateful Data Streaming and Orchestration Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments: Secure storage of wallets Sending and receiving on the blockchain Trading via brokers and exchanges Regulated environment (a key aspect and no surprise as this product is coming from the Swiss; a very regulated market) The following architecture diagrams are only available in Germany, unfortunately. But I think you get the points. With the SAGA pattern, Custodigit leverages Kafka Streams for stateful orchestration. The Custodigit microservice architecture uses microservices to integrate with financial brokers, stock markets, cryptocurrency blockchains like Bitcoin, and crypto exchanges: Source: Swisscom Custodigit implements the SAGA pattern for stateful orchestration. Stateless business logic is truly decoupled, while the saga orchestrator keeps the state for choreography between the other services: Source: Swisscom Swiss Mobiliar: Decoupling and Workflow Orchestration Swiss Mobiliar (Schweizerische Mobiliar, aka Die Mobiliar) is the oldest private insurer in Switzerland. Event Streaming powered by Apache Kafka with Kafka supports various use cases at Swiss Mobiliar: Orchestrator application to track the state of a billing process Kafka as database and Kafka Streams for data processing Complex stateful aggregations across contracts and re-calculations Continuous monitoring in real-time Mobiliar's architecture shows the decoupling of applications and orchestration of events: Source: Die Mobiliar Here you can see the data structures and states defined in API contracts and enforced via the Schema Registry: Source: Die Mobiliar Also, check out the on-demand webinar with Mobiliar and Spoud to learn more about their Kafka usage. Kestra: Open-Source Kafka Workflow Engine and Scheduling Platform Kestra is an infinitely scalable orchestration and scheduling platform, creating, running, scheduling, and monitoring millions of complex pipelines. The project is open-source and available on GitHub: Any kind of workflow: Workflows can start simple and progress to more complex systems with branching, parallel, dynamic tasks, flow dependencies Easy to learn: Flows are in simple, descriptive language defined in YAML — you don't need to be a developer to create a new flow. Easy to extend: Plugins are everywhere in Kestra, many are available from the Kestra core team, but you can create one easily. Any triggers: Kestra is event-based at heart — you can trigger an execution from API, schedule, detection, events A rich user interface: The built-in web interface allows you to create, run, and monitor all your flows — no need to deploy your flows, just edit them. Enjoy infinite scalability: Kestra is built around top cloud native technologies — scale to millions of executions stress-free. This is an excellent project with a cool drag-and-drop UI: I copied the description from the project. Please check out the open-source project for more details. What About Apache Flink for Streaming Workflow Orchestration and BPM? This post covered various case studies for the Kafka ecosystem as a stateful workflow orchestration engine for business process automation. Apache Flink is a stream processing framework that complements Kafka and sees significant adoption. The above case studies used Kafka Streams as the stateful stream processing engine. It is a brilliant choice if you want to embed the workflow logic into its own application/microservice/container. Apache Flink has other sweet spots: Powerful stateful analytics, unified batch and real-time processing, ANSI SQL support, and more. In a detailed blog post, I explored the differences between Kafka Streams and Apache Flink for stream processing. Apache Flink for Complex Event Processing (Cep) And Cross-Cluster Queries The following differentiating features make Flink an excellent choice for some workflows: Complex Event Processing (CEP): CEP generates new events to trigger action based on situations it detects across multiple event streams with events of different types (situations that build up over time and space). Event Stream Processing (ESP) detects patterns over event streams with homogenous events (i.e., patterns over time). The powerful pattern API of FlinkCEP allows you to define complex pattern sequences you want to extract from your input stream to detect potential matches. "Read once, write many": Flink allows different analytics without readying the Kafka topic several times. For instance, two queries on the same Kafka Topic like "CTAS .. * FROM mytopic WHERE eventType=1" and "CTAS .. * FROM mytopic WHERE eventType=2" can be grouped together. The query plan will only do one read. This fundamentally differs from Kafka-native stream processing engines like Kafka Streams or KQL. Cross-cluster queries: Data processing across different Kafka topics of independent Kafka clusters of different business units is a new kind of opportunity to optimize and automate an existing end-to-end business process. Be careful when using this feature wisely, though. It can become an anti-pattern in the enterprise architecture and create complex and unmanageable “spaghetti integrations.” Should you use Flink as a workflow automation engine? It depends. Flink is great for stateful calculations and queries. The domain-driven design enabled by data streaming enables choosing the right technology for each business problem. Evaluate Kafka Streams, Flink, BPMS, and RPA. Combine them as it makes the most sense from a business perspective, cost, time to market, and enterprise architecture. A modern data mesh abstracts the technology behind each data product. Kafka as the heart decouples the microservices and provides data sharing in real-time. Kafka as Workflow Engine or With Other Orchestration Tools BPMN or similar flowcharts are great for modeling business processes. Business and technical teams easily understand the visualization. It documents the business process for later changes and refactoring. Various workflow engines solve the automation: BPMS, RPA tools, ETL and iPaaS data integration platforms, or data streaming. The blog post explored several case studies where Kafka is used as a scalable and reliable workflow engine to automate business processes. This is not always the best option. Human interaction, long-running processes, or complex workflow logic are signs to choose a dedicated tool may be better. Ensure understanding of the underlying design pattern like SAGA, evaluate dedicated orchestration and BPM engines like Camunda, and choose the right tool for the job. Combining data streaming and a separate workflow engine is sometimes best. However, the impressive example of Salesforce proved Kafka could and should be used as a scalable, reliable workflow and stateful orchestration engine for the proper use cases. Fortunately, modern enterprise architecture concepts like microservices, data mesh, and domain-driven design allow you to choose the right tool for each problem. For instance, in some projects, Apache Flink might be a terrific add-on for challenges that require Complex Event Processing to automate the stateful workflow. How do you implement business process automation? What tools do you use with or without Kafka? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.
In this article, we explore the comparison between JMS (Java Message Service) and Kafka, two widely used messaging systems in modern applications. While both serve the purpose of enabling efficient communication between applications, they differ significantly in their design, features, and use cases. In today's distributed architectures, where applications and services are spread across multiple nodes, messaging systems serve as the backbone for inter-application communication. They enable asynchronous and decoupled communication, allowing different components to exchange data reliably without being tightly coupled. By utilizing messaging systems, developers can build scalable, fault-tolerant, and loosely-coupled architectures that are flexible and resilient to failures. Imagine a dynamic distributed architecture where applications seamlessly communicate, ensuring efficient data flow and optimal operations. This interconnectedness relies on messaging systems, the backbone of inter-application communication. Among the leading messaging systems today, JMS (Java Message Service) and Kafka take center stage. While both serve the purpose of enabling efficient communication between applications, they differ significantly in their design, features, and use cases. Messaging systems play a critical role in distributed architectures by facilitating communication and data exchange between applications. Choosing the right messaging system depends on the specific requirements and characteristics of your architecture. JMS excels in traditional enterprise scenarios, providing reliable messaging and seamless integration capabilities. On the other hand, Kafka shines in high-throughput and real-time data streaming use cases, offering fault tolerance and horizontal scalability. In this article, we will delve into a comprehensive comparison of JMS and Kafka, exploring their strengths, weaknesses, and when to choose one over the other. Understanding JMS Java Message Service (JMS) is a widely adopted messaging API in the Java ecosystem. It provides a standard way of creating, sending, and receiving messages between distributed systems. With its strong emphasis on reliability, JMS ensures that messages are delivered and processed in a guaranteed manner. It offers durable queues and topics, enabling messages to be stored persistently and consumed at the recipient's convenience. JMS is also renowned for its seamless integration capabilities with various enterprise technologies, making it a preferred choice for traditional enterprise applications. Furthermore, JMS offers support for different message exchange patterns, including synchronous and asynchronous communication. In synchronous communication, the sender waits for a response from the receiver before continuing, while in asynchronous communication, the sender continues its execution without blocking. This flexibility allows developers to choose the appropriate communication pattern based on the specific requirements of their applications. JMS provides support for two fundamental messaging models: publish-subscribe and point-to-point. Understanding these models helps in choosing the most suitable approach for a given messaging scenario. Publish-subscribe model In the publish-subscribe model, messages are sent to a topic, and multiple subscribers can receive those messages concurrently. This model is based on the concept of message broadcasting, where publishers publish messages without any knowledge of the subscribers. Subscribers express their interest in specific topics by subscribing to them. When a message is published on a topic, it is delivered to all subscribers who have expressed interest in that topic. The publish-subscribe model is well-suited for scenarios where messages need to be disseminated to multiple recipients. For example, in a financial trading application, stock price updates can be published on a specific topic, and multiple subscribers, such as traders or analytics systems, can receive and process those updates simultaneously. Point-to-point model In the point-to-point model, messages are sent to specific queues, and each message is consumed by a single receiver. Unlike the publish-subscribe model, where messages are broadcasted to multiple subscribers, point-to-point messaging ensures that each message is delivered to only one consumer. Messages are stored in the queue until they are consumed by the intended receiver. The point-to-point model is suitable for scenarios where messages need to be processed by a single recipient, such as task distribution or request-response communication. For instance, in an order processing system, when a new order is received, it can be placed in a queue, and a dedicated consumer retrieves and processes the order from the queue, ensuring that each order is handled by only one consumer. JMS is known for its robustness and reliability, making it a popular choice for enterprise applications. It ensures the reliable delivery of messages through features like message acknowledgment and persistent storage. When a message is consumed, the consumer sends an acknowledgment back to the messaging system, indicating that the message has been successfully processed. If an acknowledgment is not received, the messaging system can redeliver the message or take appropriate action based on the configured acknowledgment settings. Real-World Examples of JMS Usage and Its Benefits JMS is extensively used in various real-world scenarios, showcasing its versatility and benefits. Some examples include: Order processing systems: JMS can be employed to handle incoming orders in an e-commerce system. Orders can be placed in a queue, and multiple consumers can process them concurrently. This ensures efficient order processing and scalability while maintaining the reliability of message delivery. Financial services: In the financial industry, JMS is utilized for real-time market data dissemination. Stock price updates, market news, and trade notifications can be published on specific topics, allowing subscribers to receive and process the information in real-time. This enables traders, analysts, and other financial applications to stay up-to-date with market events. Event-driven architectures: JMS is often used in event-driven architectures, where different components communicate through events. Events can be published on topics, and subscribers can react to specific events of interest. This decoupled communication model enables loose coupling between components and enhances the scalability and flexibility of the overall architecture. Enterprise messaging: JMS shines in enterprise messaging scenarios, where reliable and asynchronous communication is essential. It is widely used in systems involving order processing, inventory management, and customer support, where reliable message delivery is crucial for business operations. Transactional systems: JMS's support for distributed transactions makes it a good fit for transactional systems. It ensures that messages are delivered reliably and consistently, enabling systems to maintain data integrity and process transactions in a robust and reliable manner. Legacy system integration: JMS provides a standardized integration approach, making it suitable for integrating legacy systems with modern applications. By leveraging JMS, organizations can bridge the gap between new and legacy systems, enabling seamless communication and data exchange between different components. Exploring Kafka Kafka, initially developed at LinkedIn, has emerged as a prominent distributed streaming platform in the realm of messaging systems. It was designed to handle high-throughput, fault-tolerant, and real-time data streaming. Kafka's architecture and unique characteristics set it apart from traditional messaging systems. At its core, Kafka operates as a distributed commit log. It uses a cluster of servers called brokers, where messages are stored in an append-only log structure. This distributed architecture allows Kafka to handle massive data streams while ensuring fault tolerance and scalability. Messages in Kafka are organized into topics, and producers publish messages to these topics. Consumers can then subscribe to the topics and receive messages in a highly efficient and parallelized manner. One of Kafka's key strengths lies in its ability to handle high-throughput data streams. By leveraging its distributed architecture, Kafka can partition topics into multiple partitions, allowing for parallel processing of messages across multiple consumers. This horizontal scalability enables Kafka to handle large volumes of data and sustain high message rates without sacrificing performance. Furthermore, Kafka ensures fault tolerance through its replication mechanism. Each partition in Kafka has multiple replicas, spread across different brokers. If a broker fails, one of the replicas automatically takes over, ensuring uninterrupted data streaming and minimizing the risk of data loss. This fault-tolerant design makes Kafka a reliable choice for mission-critical applications where data integrity and continuity are paramount. Real-World Examples of Kafka Usage Kafka's strengths make it well-suited for a range of use cases, particularly those that require real-time analytics and event-driven architectures: Real-time analytics: Kafka enables organizations to process and analyze streaming data in real-time. By ingesting and storing large volumes of data in real-time, Kafka acts as a reliable and scalable data pipeline for analytics platforms. This allows businesses to gain actionable insights, perform fraud detection, monitor user behavior, and generate timely reports based on up-to-date information. Event-driven architectures: Kafka's publish-subscribe model makes it an excellent choice for event-driven architectures. Events generated by different systems and components can be published to Kafka topics, and interested parties can subscribe to these topics to consume and react to the events. This decoupled communication pattern fosters loose coupling, scalability, and flexibility in building event-driven systems, such as microservices architectures or IoT (Internet of Things) platforms. Log aggregation and stream processing: Kafka's log-based storage architecture makes it ideal for log aggregation and stream processing. By collecting logs from various systems and applications, Kafka consolidates them into a central repository, enabling easy search, analysis, and debugging. Additionally, Kafka integrates seamlessly with stream processing frameworks like Apache Spark and Apache Flink, enabling real-time stream processing, transformations, and complex event processing on the data streams. Additionally, as a valuable resource for gaining deeper insights into Kafka, we recommend listening to a podcast featuring Robin Moffatt, a Senior Developer Advocate at Confluent, the company founded by the original creators of Apache Kafka. In this podcast, Robin Moffatt shares his expertise on how Kafka works and provides valuable insights into leveraging its capabilities for building robust streaming architectures. You can access the podcast episode at the following link: Explaining how Kafka works with Robin Moffatt. Comparing Architecture of JMS and Kafka JMS typically follows a broker-based architecture, where messages are exchanged through a message broker or middleware. The message broker acts as an intermediary between the sender and the receiver, facilitating reliable and guaranteed message delivery. In a JMS broker-based architecture, producers send messages to a destination (queue or topic) hosted by the message broker. The broker stores the messages until they are consumed by the intended consumers. When a consumer retrieves a message, it is removed from the destination. The broker ensures reliable message delivery through various mechanisms. For instance, in point-to-point messaging, messages are stored in a queue, and each message is consumed by a single consumer. The broker guarantees that each message is delivered to only one consumer, ensuring a reliable one-to-one message exchange. In publish-subscribe messaging, messages are sent to a topic, and multiple subscribers can receive those messages concurrently. The broker ensures that each subscriber receives a copy of the message, enabling one-to-many message exchange. Kafka, in contrast to JMS, employs a distributed commit log architecture. This design allows Kafka to handle high-throughput, fault-tolerant data streaming and enables its unique features. In Kafka's architecture, messages are stored in a distributed, fault-tolerant commit log called a topic. The commit log is partitioned into multiple segments, where each segment represents an ordered sequence of messages. Each partition is replicated across multiple Kafka brokers, ensuring fault tolerance and data redundancy. Kafka's partitioning mechanism plays a crucial role in its scalability and fault-tolerance. Messages within a topic's partitions are ordered, ensuring that they are processed in the order they were produced. By partitioning the data and distributing it across multiple brokers, Kafka achieves horizontal scalability. Consumers can read from multiple partitions concurrently, enabling high throughput and efficient processing of large message streams. Difference in Scalability Between JMS and Kafka One of the significant differences between JMS and Kafka lies in their scalability and performance characteristics. JMS implementations often face challenges when dealing with high message volumes, as the broker can become a bottleneck. Adding more brokers to the JMS infrastructure may improve scalability, but it can introduce complexities and additional management overhead. In contrast, Kafka's distributed commit log architecture and partitioning mechanism enable it to handle massive message streams with ease. Kafka's horizontal scalability allows it to distribute the message processing across multiple brokers, accommodating high throughput and scaling to meet the demands of large-scale data streaming. Difference in Performance Between JMS and Kafka JMS, with its emphasis on reliable and guaranteed message delivery, may introduce some additional latency compared to Kafka. The broker-based architecture of JMS can become a bottleneck when dealing with high message volumes, as the broker needs to handle message routing and management. While adding more brokers can improve scalability, it can also introduce complexities and additional management overhead. Therefore, JMS might not be the ideal choice for scenarios that require ultra-low latency and high throughput. On the other hand, Kafka's distributed commit log architecture and partitioning mechanism make it highly performant. Kafka excels in scenarios where real-time data streaming and processing are crucial. Its horizontal scalability enables it to distribute the message processing across multiple brokers, accommodating high throughput and scaling to meet the demands of large-scale data streaming. With its low latency and high throughput capabilities, Kafka is a popular choice for event-driven architectures, real-time analytics, and other use cases where processing large volumes of data in real-time is essential. Analysis of Scalability and Throughput of JMS and Kafka Topic JMS Kafka Challenges of high message volumes - Broker-based architecture can become a bottleneck as message volume increases- Strained system resources due to routing, storage, and management of messages- Potential performance degradation and scalability limitations - Distributed commit log architecture and partitioning mechanism enable scalability and high throughput- Workload distributed across multiple brokers for parallel processing and efficient resource utilization- Can handle large message volumes without becoming a bottleneck Scaling for high message volumes - Adding more brokers to distribute the load- Complexity and additional management overhead- Maintaining message order across multiple brokers can be challenging - Horizontal scalability through distributed architecture- Independent consumption of partitions allows for high concurrency and throughput- Simplified scaling without compromising message order Maintaining message order in high throughput - Challenges in maintaining message order across multiple brokers - Messages within a partition are ordered, ensuring sequencing of events Suitable for event sourcing and log aggregation - Limited suitability for maintaining message order in scenarios like event sourcing or log aggregation - Well-suited for maintaining message order in event-driven scenarios and log aggregation Persistence and Durability of JMS and Kafka JMS (Java Message Service) places a strong emphasis on guaranteed message delivery and persistence. In JMS, messages are not lost during transmission or processing. This is achieved through several mechanisms: Guaranteed Message Delivery: JMS places a strong emphasis on guaranteed message delivery and persistence. It ensures that messages are reliably delivered to their intended destinations, even in the face of failures or disruptions. JMS achieves this through features such as message acknowledgment and persistent storage. Persistence: JMS supports persistent message storage, which enhances durability. When a message is marked as persistent, it is stored in a persistent message store, such as a database or a file system. This ensures that messages survive system failures and can be retrieved even if the messaging system or application restarts. JMS also provides additional features to support reliability, such as redelivery policies and transaction support. Redelivery policies allow configuring the number of times a message should be retried before being marked as undeliverable. Transaction support ensures atomicity and consistency when sending or receiving messages in a transactional manner. Kafka, on the other hand, utilizes a log-based storage architecture and fault-tolerant data replication to ensure persistence and durability. Log-Based Storage Architecture: In Kafka, messages are stored in logs. Each topic is divided into partitions, and each partition consists of an ordered log of messages. This log-based architecture provides several benefits. Messages are appended to the end of the log, allowing for fast writes. Additionally, since the log is an append-only data structure, it enables efficient storage and minimizes disk I/O. Fault-Tolerant Data Replication: In the event of a failure, Kafka's fault-tolerant design allows for automatic failover. If a broker becomes unavailable, one of the replicas automatically takes over the leadership of the partition, ensuring continuous message availability and preventing data loss. This fault-tolerant architecture guarantees that messages are persisted and replicated for reliable and durable data storage. Fault Recovery: Due to Kafka's distributed and replicated nature, it provides fault recovery capabilities. If a broker fails, Kafka automatically detects the failure and elects a new leader for the affected partitions. This leader transition process ensures that message processing remains uninterrupted, and data availability is maintained even during failures. Data Replayability: Kafka's durable logs enable data replayability. Since messages are stored persistently and retained in the log, it is possible to replay the messages and process them again if needed. This capability is valuable in scenarios such as reprocessing data for analytics, testing, or recovering from system failures. Data replayability ensures that no information is lost and provides flexibility in handling and processing data in different scenarios. JMS emphasizes guaranteed message delivery and persistence, providing features like acknowledgment, persistence storage, redelivery policies, and transaction support. Kafka, on the other hand, leverages its log-based storage architecture and fault-tolerant data replication to ensure durability. Kafka's durability enables fault recovery by electing new leaders during failures and offers data replayability by retaining messages in durable logs, allowing consumption from any point in time. Integration With JMS and Kafka JMS benefits from a vibrant ecosystem and extensive integration capabilities with various Java technologies. Being a part of the Java EE (Enterprise Edition) platform, JMS seamlessly integrates with other Java frameworks and tools, making it a natural choice for Java-based enterprise applications. JMS integrates well with Java application servers, allowing for easy deployment and management of messaging resources. Furthermore, JMS has numerous client libraries and connectors available, offering developers flexibility in choosing the implementation that best suits their needs. These libraries provide APIs and tools for interacting with JMS, simplifying the development process and enabling efficient integration with Java applications. JMS also integrates with popular Java frameworks like Spring, enabling seamless integration with Spring-based applications and leveraging the powerful features provided by both JMS and Spring. Kafka's design and popularity have fostered a strong integration ecosystem, particularly in the big data realm. Kafka integrates seamlessly with various big data tools and ecosystems, making it a valuable component in modern data architectures. Kafka integrates well with Apache Hadoop, a widely used big data framework, allowing for seamless data ingestion and processing. It acts as a reliable and efficient data pipeline for streaming data into Hadoop clusters, enabling real-time analytics and processing of data at scale. Kafka's integration with Apache Spark, another popular big data processing framework, enables efficient stream processing and complex event processing on the data streams. Moreover, Kafka integrates with Apache Storm, a distributed real-time computation system, enabling real-time processing of streaming data with low latency. This integration allows for the creation of powerful real-time applications that can process and analyze data as it flows through the Kafka topics. In addition to the extensive integration capabilities offered by JMS and Kafka, organizations can further enhance their messaging and data streaming solutions through integration like with Toro Cloud's Martini integration platform. Martini acts as a versatile integration layer, seamlessly connecting JMS and Kafka with other systems and technologies, simplifying the integration process and enabling smooth communication across different components. Guidelines for Choosing Between JMS and Kafka Based on Specific Requirements When deciding between JMS and Kafka for a messaging solution, it's important to consider the specific requirements of the use case. Here are some guidelines to help in the decision-making process: Message Reliability and Guaranteed Delivery: If guaranteed message delivery and strict reliability are critical, JMS is recommended. Its features, such as message acknowledgment and persistent storage, ensure that messages are reliably delivered. JMS is well-suited for scenarios like transactional systems and enterprise messaging. High Scalability and Real-time Data Streaming: If the use case involves handling high message volumes, real-time data streaming, and the need for horizontal scalability, Kafka is the recommended choice. Its distributed commit log architecture, partitioning mechanism, and fault-tolerant replication enable it to handle massive message streams with ease. Integration Ecosystem and Tooling: Consider the existing technology ecosystem and integration requirements. JMS integrates seamlessly with Java technologies and enterprise systems, making it a natural fit for Java-based applications. Kafka has a strong integration ecosystem, particularly within the big data space, allowing for seamless integration with various big data tools and frameworks. Development Flexibility and Processing Models: If the use case requires flexibility in message consumption, replayability, and the ability to control the pace of message processing, Kafka's pull-based model provides advantages. JMS offers different consumption models like point-to-point and publish-subscribe, suitable for specific messaging patterns. By carefully considering the specific requirements and strengths of JMS and Kafka, organizations can make an informed decision and choose the messaging solution that best aligns with their use case, ensuring efficient and reliable communication in their distributed systems. Conclusion The comparison between JMS and Kafka reveals distinct characteristics and strengths that make them suitable for different messaging scenarios. JMS excels in providing guaranteed message delivery, reliability, and seamless integration with Java technologies. It is recommended for use cases such as enterprise messaging, transactional systems, and legacy system integration. On the other hand, Kafka stands out with its scalability, fault tolerance, and real-time data streaming capabilities. It is a powerful choice for scenarios like real-time analytics, event-driven architectures, and log aggregation. Kafka's integration with big data tools further enhances its value in the big data ecosystem. When evaluating messaging needs, it is essential for readers to consider their specific requirements. Factors such as message reliability, scalability, integration ecosystem, and processing models should be carefully assessed. By thoroughly evaluating their own needs, organizations can choose between JMS and Kafka to build efficient, scalable, and reliable messaging solutions that align with their specific use cases. Ultimately, both JMS and Kafka offer robust messaging solutions with their unique strengths and features. Understanding the characteristics of each system and considering the specific requirements will empower organizations to make an informed decision and choose the messaging system that best suits their needs, enabling seamless communication, efficient data processing, and successful implementation of their distributed architectures.
When talking about data processing, people often abbreviate it as “ETL.” However, if we look closely, data processing has undergone several iterations from ETL, ELT, XX ETL (such as Reverse ETL, Zero-ETL) to the currently popular EtLT architecture. While the Hadoop era mainly relied on ELT (Extract, Load, Transform) methods, the rise of real-time data warehouses and data lakes has rendered ELT obsolete. EtLT has emerged as the standard architecture for real-time data loading into data lakes and real-time data warehouses. Let’s explore the reasons behind the emergence of these architectures, their strengths and weaknesses, and why EtLT is gradually replacing ETL and ELT as the global mainstream data processing architecture, along with practical open-source methods. ETL Era (1990–2015) In the early days of data warehousing, Bill Inmmon, the proponent of data warehousing, defined it as a data storage architecture for partitioned subjects, where data was categorized and cleaned during storage. During this period, most data sources were structured databases (e.g., MySQL, Oracle, SQLServer, ERP, CRM), and data warehouses predominantly relied on OLTP databases (e.g., DB2, Oracle) for querying and historical storage. Handling complex ETL processes with such databases proved to be challenging. To address this, a plethora of ETL software emerged, such as Informatica, Talend, and Kettle, which greatly facilitated integrating complex data sources and offloading data warehouse workloads. Advantages: Clear technical architecture, smooth integration of complex data sources, and about 50% of the data warehouse work handled by ETL software. Disadvantages: All processing is done by data engineers, leading to longer fulfillment of business requirements; high hardware costs with double investments, especially when dealing with large data volumes. During the early and mid stages of data warehousing, when data source complexity was significantly higher, the ETL architecture became the industry standard and remained popular for over two decades. ELT Era (2005–2020) As data volumes grew, both data warehouse and ETL hardware costs escalated. New MPP (Massively Parallel Processing) and distributed technologies emerged, leading to the gradual shift from ETL to ELT architecture in the later stages of data warehousing and the rise of big data. Teradata, one of the major data warehouse vendors, and Hadoop Hive, a popular Hadoop-based data warehousing solution, adopted ELT architecture. They focused on direct loading (without complex transformation like join and group) of data into the data warehouse’s data staging layer, followed by further processing using SQL or H-SQL from the staging layer to the data atomic layer and finally to the summary layer and indicator layer. While Teradata targeted structured data and Hadoop targeted unstructured data, they adopted similar 3–4 layer data storage architectures and methodologies for data warehousing globally. Advantages: Utilizing data warehouse high-performance computing for large data volume processing, resulting in higher hardware ROI. Complex business logic can be handled by SQL, reducing the overall cost of data processing personnel by employing data analysts and business-savvy technical personnel, without the need to understand ETL tools like Spark and MapReduce. Disadvantages: Suitable for simple and large data volume scenarios. Inadequate for complex data sources and lacking support for real-time data warehousing requirements. ODS (Operational Data Store) was introduced as a transitional solution to handle complex data sources that ELT-based data warehouses couldn’t load and to improve real-time capabilities. ODS involved processing complex data sources through real-time CDC (Change Data Capture), real-time APIs, or short-batch processing (Micro-Batch) into a separate storage layer before ELT-ing them into the enterprise data warehouse. Currently, many enterprises still adopt this approach. Some companies place the Operational Data Store (ODS) within the data warehouse and use Spark and MapReduce for initial ETL (Extract, Transform, Load) processes. Later, they perform business data processing within the data warehouse (using tools like Hive, Teradata, Oracle, and DB2). At this stage, the early EtLT (Extract, Transform, Load, and Transform) community has already formed. It is characterized by a division of roles, where the complex processes of data extraction, Change Data Capture (CDC), data structuring, and standardization are often handled by data engineers, referred to as “t.” Their objective is to move data from the source system to the underlying data preparation layer or data atomic layer within the data warehouse. On the other hand, the processing of complex data atomic layers with business attributes, data aggregation, and generating data metrics (involving operations such as Group by and Join) is typically performed by business data engineers or data analysts who are skilled in using SQL. As a result of the emergence of the EtLT architecture, standalone projects like ODS have gradually faded out of the limelight due to the increase in data volume and the adoption of EtLT principles. EtLT Era (2020-Present) The EtLT architecture, as summarized by James Densmore in the Data Pipelines Pocket Reference 2021, is a modern and globally popular data processing framework. EtLT emerged in response to the transformations in the modern data infrastructure. Background of the EtLT Architecture The modern data infrastructure has the following characteristics, which led to the emergence of the EtLT architecture: Cloud, SaaS, and Hybrid Local Data Sources Data Lake and Real-time Data Warehouses New Generation Big Data Federation Proliferation of AI Applications Fragmentation of Enterprise Data Community The Appearance of Complex Data Sources In the current global enterprise landscape, the advent of cloud and SaaS has made an already complex data source environment even more intricate. Dealing with SaaS data has given rise to a new concept of data ingestion, exemplified by tools like Fivetran and Airbyte, aiming to address the ELT (Extract, Load, Transform) challenges of ingesting SaaS data into data warehouses like Snowflake. Additionally, the complexity of data sources has increased with the proliferation of cloud-based data storage services (e.g., AWS Aurora, AWS RDS, MongoDB Service) and the coexistence of traditional on-premises databases and software (SAP, Oracle, DB2) in hybrid cloud architectures. Traditional ETL and ELT architectures are unable to cope with the intricacies of processing data in such a complex environment. Data Lake and Real-Time Data Warehouses In modern data architecture, the emergence of data lakes has combined the features of traditional ODS (Operational Data Store) and data warehouses. Data lakes enable real-time data processing and facilitate data changes at the source (e.g., Apache Hudi, Apache Iceberg, Databricks Delta Lake). Simultaneously, the concept of real-time data warehouses has surfaced, with various new computing engines (e.g., Apache Pinot, ClickHouse, Apache Doris) making real-time ETL a priority. However, traditional CDC ETL tools or real-time stream processing platforms face challenges in providing adequate support for data lakes and real-time data warehouses, either due to compatibility issues with new storage engines or limitations in connecting to modern data sources, lacking robust architecture and tool support. The Emergence of a New Generation of Big Data Federation In modern data architecture, a new breed of architectures has emerged, aiming to minimize data movement across different data stores and enabling complex queries directly through connectors or rapid data loading. Examples include Starburst’s TrinoDB (formerly PrestoDB) and OneHouse based on Apache Hudi. These tools excel at data caching and on-the-fly cross-data-source queries, making them unsuitable for support by traditional ETL/ELT tools in this new Big Data Federation paradigm. The Rise of Large-Scale Models With the emergence of ChatGPT in 2022, AI models have become algorithmically feasible for widespread enterprise applications. The bottleneck for AI application deployment now lies in data supply, which has been addressed by data lakes and Big Data Federation for data storage and querying. However, traditional ETL, ELT, and stream processing have become bottlenecks for data supply, either unable to quickly integrate various complex traditional and emerging data sources or failing to support diverse data requirements for both AI training and online AI applications using a single codebase. Fragmentation of the Enterprise Data Community As data-driven approaches become more ingrained in enterprises, the number of data users within organizations has rapidly increased. These users range from traditional data engineers to data analysts, AI practitioners, sales analysts, and financial analysts, each with diverse data requirements. After experiencing various shifts like No-SQL and New-SQL, SQL has emerged as the sole standard for complex business analysis. A considerable number of analysts and business unit engineers now use SQL to address the “last mile” problem of data analysis within enterprises. Meanwhile, the handling of complex unstructured data is left to professional data engineers using technologies like Spark, MapReduce, and Flink. Consequently, the demands of these two groups diverge significantly, making traditional ETL and ELT architectures inadequate to meet the needs of modern enterprise users. EtLT Architecture Emerges! Against the backdrop mentioned above, data processing has gradually evolved into the EtLT architecture: EtLT Architecture Overview: EtLT splits the traditional ETL and ELT structures and combines real-time and batch processing to accommodate real-time data warehouses and AI application requirements. E(xtract) — Extraction: EtLT supports traditional on-premises databases, files, and software as well as emerging cloud databases, SaaS software APIs, and serverless data sources. It can perform real-time CDC on database binlog and real-time stream processing (e.g., Kafka Streaming) and also handle bulk data reading (multi-threaded partition reading, rate limiting, etc.). t(ransform) — Normalization: In addition to ETL and ELT, EtLT introduces a small “t,” focusing on data normalization. It rapidly converts complex and heterogeneous extracted data sources into structured data that can be readily loaded into the target data storage. It deals with real-time CDC by splitting, filtering, and changing field formats, supporting both batch and real-time distribution to the final Load stage. L(oad) — Loading: The loading stage is no longer just about data loading but also involves adapting data source structures and content to suit the target data destination (Data Target). It should handle data structure changes (Schema Evolution) in the source and support efficient loading methods such as Bulk Load, SaaS loading (Reverse ETL), and JDBC loading. EtLT ensures support for real-time data and data structure changes, along with fast batch data loading. (T)ransform — Conversion: In cloud data warehouses, on-premises data warehouses, or new data federations, business logic is processed. This is typically achieved using SQL, either in real-time or batch mode, to transform complex business rules accurately and quickly into data usable by business or AI applications. In the EtLT architecture, different user roles have distinct responsibilities: EtL Phase: Primarily handled by data engineers who convert complex and heterogeneous data sources into data that can be loaded into the data warehouse or data federation. They do not need to have an in-depth understanding of enterprise metric calculation rules but must be proficient in transforming various source and unstructured data into structured data. Their focus is on data timeliness and the accuracy of transforming source data into structured data. T Phase: Led by data analysts, business SQL developers, and AI engineers who possess a deep understanding of enterprise business rules. They convert business rules into SQL statements to perform analysis and statistics on the underlying structured data, ultimately achieving data analysis within the enterprise and enabling AI applications. Their focus is on data logic relationships, data quality, and meeting business requirements for final data results. Open-Source Implementation of EtLT There are several open-source implementations of EtLT in modern data architecture. Examples include DBT, which helps analysts and business developers quickly develop data applications based on Snowflake, and Apache DolphinScheduler, a visual workflow orchestration tool for big data tasks. DolphinScheduler plans to introduce a Task IDE, allowing data analysts to directly debug SQL tasks for Hudi, Hive, Presto, ClickHouse, and more and create workflow tasks through drag-and-drop. As a representative of the EtLT architecture, Apache SeaTunnel started with support for various cloud and on-premises data sources, gradually expanding its capabilities to include SaaS and Reverse ETL, as well as accommodating the demands of large-scale model data supply. It has been continually refining the EtLT landscape. The latest SeaTunnel Zeta computing engine delegates complex operations such as Join and Groupby to the ultimate data warehouse endpoint, focusing on data normalization and standardization. This approach aims to achieve the goal of unified real-time and batch data processing with a single set of code and a high-performance engine. Additionally, SeaTunnel now includes support for large-scale models, making it possible for these models to directly interact with over 100 supported data sources, ranging from traditional databases to cloud databases and, ultimately, SaaS. Since joining the Apache Incubator in late 2022, Apache SeaTunnel has witnessed a five-fold growth in one year, and currently, it supports more than 100 data sources. The connector support has been progressively improved, encompassing traditional databases, cloud databases, and SaaS offerings. The release of SeaTunnel Zeta Engine in Apache SeaTunnel 2.3.0 brings features such as data distributed CDC, schema evolution for target source data tables, and the synchronization of entire databases and multiple tables. Its excellent performance has garnered attention from numerous global users, including Bharti Airtel, the second-largest telecommunications operator in India, Shopee.com, an e-commerce platform in Singapore, and Vip.com, a major online retailer. Large-Scale Model Training Support One noteworthy aspect is that SeaTunnel now offers support for large-scale model training and vector databases, enabling seamless interactions between large models and the 100+ supported data sources of SeaTunnel. Furthermore, SeaTunnel can leverage ChatGPT to directly generate SaaS Connectors directly, facilitating rapid access to a wide range of internet information for your large-scale models and data warehouses. As the complexity of AI, cloud, and SaaS continues to increase, the demand for real-time CDC, SaaS, data lakes, and real-time data warehouse loading has made simple ETL architectures inadequate to meet the needs of modern enterprises. EtLT architecture, tailored for different stages of enterprise development, is destined to shine in the modern data infrastructure. With the mission of “Connecting the world’s data sources and synchronizing them as swiftly as flying,” Apache SeaTunnel warrants the attention of all data professionals.
Nowadays, we usually build multiple services for a single product to work, and client apps need to consume functionality from more than one service. Microservices architecture has become a popular approach for building scalable and resilient applications. In a microservices-based system, multiple loosely coupled services work together to deliver the desired functionality. One of the key challenges in such systems is exchanging data between microservices reliably and efficiently. One pattern that can help address this challenge is the Outbox pattern. In this article, we will explore how to implement the outbox pattern with a streaming database which can provide a reliable solution for microservices or multiple services data exchange. The Need for Reliable Microservices Data Exchange In a microservices architecture, each microservice has business logic, is responsible for its own data, has its own local data store, and performs its own operations on that data (Data per service pattern). However, there are scenarios where microservices need to share data with each other or notify other services about any specific data change in real-time to maintain consistency and provide a cohesive experience to end users. For example, consider in a ride-hailing service, there may be multiple microservices responsible for different functionalities, such as user management, ride booking, driver management, and payment processing. When a user requests a ride, it triggers a series of events that need to be propagated to various microservices for processing and updating their data. Ride-hailing services are where a customer orders the ride from a ride-hailing platform. The best-known such services are Uber, Lyft, and Bolt. Traditional synchronous communication between microservices can lead to tight coupling and potential performance and reliability issues. A sending service needs to know the location, interface, and contract of other microservices, which can result in a complex web of dependencies. This can make it challenging to evolve, test, and deploy microservices independently, as any change in one microservice may require changes in multiple dependent microservices. Sometimes these target services might temporarily not be available and can introduce performance overhead due to the need for waiting and blocking until a response is received. Asynchronous and Decoupled Data Exchange On the other hand, the Outbox pattern promotes asynchronous and decoupled data exchange between microservices. When an event or change occurs in one microservice, it writes the event or changes to its outbox, which acts as a buffer. The outbox can be implemented as a separate database table that the service owns. The microservice's outbox is then processed by an Outbox Processor (by a separate component or service such as a streaming database RisingWave which is explained in the below section), which reads the events or changes from the outbox and sends them to other microservices or data stores asynchronously. This allows microservices to continue processing requests without waiting for the data exchange to complete, resulting in improved performance and scalability. Microservice B, which needs to be updated with the changes from Microservice A, receives the events or changes from the outbox processor and applies them to its own state or data store. This ensures that Microservice B's data remains consistent with the changes made in Microservice A. Now let's see how the Outbox pattern can be implemented using the streaming database with an example of RisingWave. There are other streaming database options in the market; this post helps you understand what streaming database is, when, and why to use it, and discusses some key factors you should consider when choosing the right streaming database for your business. RisingWave is a streaming database that helps in building real-time event-driven services. It can read directly database change events from traditional databases binlogs or Kafka topics and build a materialized view by joining multiple events together. RisingWave will keep the view up to date as new events come and allow you to query using SQL to have access to the latest changes made to the data. Outbox pattern with a streaming database The streaming database can act as a real-time streaming platform (outbox processor). It listens to any write/update operations in the specified database table using its built-in Change Data Capture (CDC) connector, captures changes, and propagates those changes to other microservices in real-time or near real-time with the help of Kafka topics (See how to sink data from RisingWave to a Kafka broker). In this article, the author Gunnar Morling explains another implementation of the outbox pattern based on CDC and the Debezium connector. One key advantage of using the streaming database is that you do not need to use both Debezium and Kafka Connect to achieve the same. Also, it has its own storage, you never lose streaming data, and you can create materialized views optimized for querying microservices which is explained in my other blog post. Plus, it gives us the ability to analyze the data by delivering them to BI and data analytics platforms for making better business decisions based on your application usage. For example, a trip history view is an important feature in a ride-hailing service to provide passengers and drivers with access to their trip history. Instead of querying the individual trip records and calculating various statistics such as total trips, earnings, ratings, etc., a materialized view can be used to store the precomputed trip history information for each user in the streaming database. With the Outbox pattern and streaming database, here's how the data exchange could work for our example ride-hailing service: Ride Service (Microservice A): When a user requests a ride, the Ride Service creates the ride details and writes an event, such as "RideRequested," to its outbox table in its own database, let's say MySQL. By default, the streaming database captures the "RideRequested" event from the Ride Service's outbox table using its connector for MySQL CDC. It processes the event and sends it to other microservices that need to be updated, such as Driver Service (Microservice B), Payment Service (Microservice C), and Notification Service (Microservice D), using a message broker like Apache Kafka. Driver Service (Microservice B): The Driver Service receives the "RideRequested" event from the Outbox Processor and finds an available driver for the ride based on the ride details. It then writes an event, such as "DriverAssigned," to its own outbox. Payment Service (Microservice C): The Payment Service also receives the "RideRequested" event from another Kafka topic and calculates the fare for the ride based on the ride details. It then writes an event, such as "FareCalculated," to its own outbox. We can create multiple Kafka topics to give options to consumer services to subscribe to only specific event types. Conclusion As we understood, synchronous communication between microservices can introduce tight coupling, performance overhead, lack of fault tolerance, limited scalability, versioning challenges, and reduced flexibility and agility. By leveraging real-time streaming platforms such as RisingWave and decoupling the data exchange process from the main transactional flow by applying an outbox pattern, you can achieve high performance, reliability, and consistency in your microservices architecture.
Miguel Garcia
VP of Engineering,
Nextail Labs
Gautam Goswami
Founder,
DataView
Alexander Eleseev
Full Stack Developer,
First Line Software
Ben Herzberg
Chief Scientist,
Satori