Case Studies: Cloud-Native Data Streaming for Data Warehouse Modernization

Let's explore a few case studies for cloud-native data streaming and data warehouse modernization.

Kai Wähner

CORE ·

Oct. 15, 22 · Analysis

Likes (3)

Comment

Save

7.5K Views

Every project is different. This is true for data streaming, analytics, and other software development. The following shows three case studies with significantly different architectures and technologies for data warehouse modernization. The examples come from various verticals: software and cloud business, financial services, logistics and transportation, and the travel and accommodation industry.

Confluent: Data Warehouse Modernization From Batch ETL With Stitch to Streaming ETL With Kafka

The article "Streaming ETL SFDC Data for Real-Time Customer Analytics" explores how Confluent eats its dog food to modernize the internal data warehouse pipeline.

The use case is straightforward and standard across most organizations: Extract, transform, and load (ETL) Salesforce data into a Google BigQuery data warehouse, so that the business can use the data. But it is more complex than it sounds.

Organizations often rely on a third-party ETL tool to periodically load data from a CRM and other applications to their data warehouse. These batch tools introduce a lag between when the business events are captured in Salesforce and when they are made available for consumption and processing. The batch workloads commonly result in discrepancies between Salesforce reports and internal dashboards, leading to concerns about the integrity and reliability of the data.

Confluent used Talend's Stitch batch ETL tool in the beginning. The old architecture looked like this:

The consequence of batch ETL and a 3rd party tool in the middle lead to insufficient and inconsistent information updates.

Over the past few years, Confluent has invested in building stream processing capabilities into the internal data warehouse pipeline. Confluent leverages its own fully managed Confluent Cloud connectors (in this case, the Salesforce CDC source and BigQuery sink connectors), Schema Registry for data governance, and ksqlDB + Kafka Streams for reliable streaming ETL to send SFDC data to BigQuery. Here is the modernized architecture:

PayPal: Reducing the Time for Readouts From 12 Hours to a Few Seconds for 30 Billion Events Per Day

PayPal has plenty of Kafka projects for many critical and analytical workloads. In this use case, it scales the Kafka Consumer for 30-35 billion events per day to migrate its analytical workloads to the Google Cloud Platform (GCP).

A streaming application ingests the events from Kafka directly to BigQuery. This is a critical project for PayPal as most of the analytical readouts are based on this. The outcome of the data warehouse modernization and building a cloud-native architecture: reduce the time for readouts from 12 hours to a few seconds.

Read more about this success story in the PayPal technology blog.

Shippeo: From On-Premise Databases to Multiple Cloud-Native Data Lakes

Shippeo provides real-time and multimodal transportation visibility for logistics providers, shippers, and carriers. Its software uses automation and artificial intelligence to share real-time insights, enable better collaboration, and unlock your supply chain’s full potential. The platform can give instant access to predictive, real-time information for every delivery.

Shippeo described how they integrated traditional databases (MySQL and PostgreSQL) and cloud-native data warehouses (Snowflake and BigQuery) with Apache Kafka and Debezium:

This is an excellent example of cloud-native enterprise architecture leveraging a "best-of-breed" approach for data warehousing and analytics. Kafka decouples the analytical workloads from the transactional systems and handles the backpressure for slow consumers.

Sykes Cottages: Fully Managed End-to-End Pipeline With Confluent Cloud, Kafka Connect, Snowflake

Sykes Holiday Cottages are one of the UK's leading and fastest-growing independent holiday cottage rental agencies representing over 19,000 cottages across the UK, Ireland, and New Zealand.

The experience of its customers on the web is a top priority and is one way to stay competitive. The goal is to match customers to their perfect holiday cottage experience and delight at each stage along the way. Getting the data pipeline to fuel this innovation is critical. Data warehouse modernization and data streaming enabled new ways to further innovate the web experience through a data-driven approach.

From Inconsistent and Slow Batch Workloads...

While serving its purpose for several years, the existing pipeline had problems impairing this cycle. Very early in this pipeline, the ETL process turned the data into rows and columns (structured data). Various copies were made, and the results were presented via a static report. Data engineers were needed for changes, such as new events or contextual information. The scale was also challenging, as this has to be done manually in the main.

Critically keeping the data in a semi-structured format until it is ingested into the warehouse and then using ELT to do a single transformation of the data, Sykes Holiday Cottages can simplify the pipeline and make it much more agile.

...To Event-Based Real-Time Updates and Continuous Stream Processing

New web events (and any context that goes with it) can be wrapped up within a message and can flow all the way to the warehouse without a single code change. The new events are then available to the web teams either through a query or the visualization tool.

The current throughput is around 50k (peaking at over 300k) messages per minute. As new events are captured, this will grow considerably. Additionally, each of the above components must scale accordingly.

The new architecture enables the web teams to capture new events and analyze the data using self-service tools with no dependency on data engineering.

In conclusion, the business case for doing this is compelling. Based on our testing and projections, we expect at least 10x ROI over three years for this investment.

In Sykes Holiday Cottages' blog post, learn more details: Why Sykes Cottages partnered with Snowflake and Confluent to drive enhanced customer experience.

DoorDash: From Multiple Pipelines to Data Streaming for Snowflake Integration

Even digital natives — that started their business in the cloud without legacy applications in their own data centers — need to modernize the enterprise architecture to improve business processes, reduce costs, and provide real-time information to its downstream applications.

It is cost-inefficient to build multiple pipelines that are trying to achieve similar purposes. DoorDash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Mixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at DoorDash. Therefore, DoorDash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

The move to a data streaming platform provides many benefits to DoorDash:

Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
Easily accessible
End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
Scalable, fault-tolerant, and easy to operate for a small team

All the details about this cloud-native infrastructure optimization are in DoorDash's engineering blog post: "Building Scalable Real-Time Event Processing with Kafka and Flink."

Real-World Case Studies for Cloud-Native Projects Prove the Business Value

Data warehouse and data lake modernization only make sense if there is a business value. Elastic scale, reduced operations complexity, and faster time to market are significant advantages of cloud services like Snowflake, Databricks, or Google BigQuery.

Data streaming plays a vital role in these initiatives to integrate with legacy and cloud-native data sources, continuous streaming ETL, true decoupling between the data sources, and multiple data sinks (lakes, warehouses, business applications).

The case studies of Confluent, PayPal, Shippeo, Sykes Cottages, and DoorDash showed their different success stories of moving into cloud-native infrastructure to rain real-time visibility and analytics capabilities. Elastic scale and fully-managed end-to-end pipelines are crucial success factors in gaining business value with consistently up-to-date information.

Data lake Data warehouse Extract, transform, load Virtual screening Cloud Data (computing)

Published at DZone with permission of Kai Wähner. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending