When Not to Choose Google Apache Kafka for BigQuery

When not to choose Google Apache Kafka for BigQuery — a review, comparison with Confluent Cloud, and data streaming landscape outlook.

Kai Wähner

CORE ·

Jul. 16, 24 · News

Likes (1)

Comment

Save

4.7K Views

Google announced its Apache Kafka for BigQuery cloud service at its conference Google Cloud Next 2024 in Las Vegas. Welcome to the data streaming club joining Amazon, Microsoft, IBM, Oracle, Confluent, and others. This blog post explores this new managed Kafka offering for GCP, reviews the current status of the data streaming landscape, and shares some criteria to evaluate when Kafka in general and Google Apache Kafka in particular should (not) be used.

Welcome Google Apache Kafka to the Data Streaming Club

Better late than never… Google announced a brand new Apache Kafka cloud service for GCP at Google Cloud Next 2024. All other leading cloud providers already have one, including AWS, Azure, Oracle, IBM, and Alibaba. Various other software vendors provide Kafka services, including Confluent, Aiven, Redpanda, WarpStream, and many more. Most leverage the open-source Kafka project as its core component, while others re-implement the Kafka protocol.

Apache Kafka and Apache Flink dominate the open-source data streaming ecosystem. Vendors and cloud solutions provide cloud-native offerings. Some developers, data engineers, and business people still struggle with a paradigm shift: Continuous data processing enables better data quality, reduced cost, and faster time to market with innovative new applications. Kafka and Flink are a match made in heaven for data streaming.

Use Cases for data streaming exist across all industries. Google Apache Kafka for BigQuery is potentially a good fit for some of them, but not for others.

Google Apache Kafka for BigQuery — What Is It?

What is Google Apache Kafka for BigQuery? Quoting Google's website: "Apache Kafka for BigQuery is a managed service that operates highly available Apache Kafka clusters. It is compatible with open source versions of Apache Kafka and includes first-party Google Cloud IAM, monitoring, logging, key management, organization policy, networking, and more." Here are a few more thoughts:

Asynchronous messaging with true decoupling and producers and consumers using the publish/subscribe pattern is possible with GCP proprietary service Google Pub/Sub. Why did Google now introduce a Kafka service? Limitations of Google Pub/Sub or because Kafka became the standard (e.g., to migrate on-premise Kafka workloads from customers)? I guess a bit of both.
Google re-uses open-source Kafka instead of re-implementing the Kafka protocol (like Microsoft Azure's Event Hubs). I like this approach as a new implementation always creates several new challenges like missing completeness, delays of new features, and unexpected behavior. The compatibility with open-source Kafka is mentioned several times. My personal assumption is that Google's main strategic goal for the new Kafka service is to migrate existing on-premise workloads into Google Cloud.
I really like that the service is secure out of the box. It is integrated with and supports Google Cloud IAM, customer-managed encryption keys (CMEK), and Virtual Private Cloud (VPC) from the beginning. This is important as most workloads at enterprises require this.
Including the term 'BigQuery' is only a marketing strategy: "Data engineers often rely on Apache Kafka to build pipelines that stream data into BigQuery and other analytics systems. Apache Kafka for BigQuery can be used for real-time and batch use cases". There is no requirement to use BigQuery for analytics. Google's Kafka service is usable with other analytics platforms, too.
Google emphasizes analytics use cases everywhere around its Kafka service; NOT transactional workloads. This approach is similar to Amazon MSK. Hopefully, the Google terms and conditions don't exclude Kafka support when the service is GA (that's what MSK does — unfortunately, too many people don't read T&C and just use a cloud service in production).

Data Streaming Is a NEW Software Category

Data streaming represents a new software category that revolutionizes the way businesses harness and process data in real-time. Unlike traditional batch processing methods, data streaming enables continuous ingestion, analysis, and processing of data as it flows through systems.

The Data Streaming Landscape 2024

Many software companies have emerged in the data streaming category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than solutions powered by open-source Kafka. Some vendors only use the Kafka protocol (e.g., Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services for data streaming around Kafka and additional stream processing engines.

Forrester Wave for Streaming Data and IDG MarketScape for Stream Processing

Apache Kafka became the de facto standard for data streaming, similar to how Amazon S3 became the de facto standard for object storage.

In December 2023, the research company Forrester published “The Forrester Wave™: Streaming Data Platforms, Q4 2023." Get free access to the report here. The leaders are Microsoft, Google, and Confluent, followed by Oracle, Amazon, Cloudera, and a few others.

In April 2024, IDC named Confluent a leader in the IDC MarketScape for Worldwide Analytic Stream Processing 2024.

It would not be a surprise if we see a Gartner Magic Quadrant for Data Streaming soon, too. Gartner reports mention Kafka and related vendors more and more year by year.

When Not To Choose Google Apache Kafka for BigQuery

Qualifying out a technology is often the easier option. Why evaluate a service if it does not meet the requirements? Let's explore when NOT to use Kafka at all, and specifically when the Google Apache Kafka service is probably NOT the right choice for you.

When Not To Use Apache Kafka

Apache Kafka has overlaps with technologies like a message broker (like IBM MQ, TIBCO, or RabbitMQ), and other streaming analytics platforms, and it actually is a database, too. But Apache Kafka is not an allrounder to solve every problem.

Apache Kafka is NOT:

A replacement for your favorite database, data warehouse, or data lake. Instead, it complements and integrates with these platforms.
An analytics platform for AI/ML model training, even though model scoring is often done within the streaming platform for critical or low-latency use cases.
A proxy for thousands of clients in bad networks.
An API Management solution, even though you can connect REST/HTTP producers and consumers against Kafka.
An IoT gateway, even though direct integration with IoT protocols like MQTT or OPC-UA is possible.
Hard real-time for safety-critical embedded workloads.

Read the thorough analysis "When NOT to use Apache Kafka?" for more details. Or watch this YouTube video:

When To Choose Another Kafka Instead of Google’s

If Apache Kafka is the right choice for your project, you still have plenty of options.

Here are a few criteria that let you easily disqualify Google Apache Kafka for BigQuery:

Non-GCP: If your use case requires on-premise, multi-cloud, hybrid cloud, or edge deployments, then you need another offer.
Critical SLAs: If you need 24/7 critical support and consulting expertise, a dedicated Kafka vendor like Confluent is the better choice. Kafka is not just for analytics, but shines for transactional workloads, too. Google's Managed Apache Kafka service is not GA yet. This will probably happen in the second half of 2024. Hence, don't even consider it for critical applications before GA.
Serverless: A managed service is not always a truly managed service. The future will show where Google goes with Kafka. But right now, Google Apache Kafka is not serverless like e.g., Confluent Cloud. You pay for capacity pricing and cluster capacity management is required. Amazon even created a second service Amazon MSK Serverless to handle this issue with its traditional MSK offering.
Complete data streaming platform: A data streaming platform requires more than just messaging: data integration with first and third-party systems, stream processing for continuous data correlation, flexible (long-term) retention with Tiered Storage, data governance, and more. The future will show us where Google's Kafka service goes. Google is a car, but not (yet) a Porsche (complete luxury car) and not yet a Google Waymo (self-driving car level 5). Google Apache Kafka even misses basic features for data streaming best practices, like defining data contracts in schemas for building data products with good data quality.

The Evolution of Data Streaming Is Not Stopping

If you did not qualify out Kafka in general or Google Apache Kafka in particular yet, that's great. Start evaluating Google's Managed Apache Kafka cloud service and compare it against self-managed open source Kafka and other semi-managed or fully-managed Kafka cloud services on GCP.

As we look ahead, the future possibilities for data streaming are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

I often get the question if I am worried about the emerging competition as I work for Confluent where we “only do data streaming”?

No, I am not! Actually, the new Google Apache Kafka cloud service is great news for the industry! Data Streaming established itself as a new software category. Research analysts like Forrester and IDG already created dedicated waves and comparisons. What could be better than working with the people who invented Kafka and the company that created this software category across all industries and continents? And competition is always good for innovation, too.

Real-time data beats slow data. That’s true in almost every use case. At Confluent, we are now ~3000 people working only on one thing: Data Streaming. I think we should celebrate this Google announcement and look forward to more mass adoption of data streaming around the world.

And as a strategic Google partner, customers can

Leverage GCP credits to consume Confluent Cloud
Leverage GCPs security and private networking infrastructure
Integrate via fully managed connectors into various GCP services like Google Big Query or Google Cloud Storage and third-party cloud solutions like MongoDB, Snowflake, or Databricks.

Are you excited about the new Google Apache Kafka cloud service? Or do you still plan to use open-source Kafka or another cloud service like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

Data processing Stream processing Data (computing) kafka

Published at DZone with permission of Kai Wähner, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending