It’s hard to believe that data warehousing (DW) has been around since 1970 when Bill Inmon first defined the term. The 1990s saw Bill Inmon and Ralph Kimball dueling on two different data warehousing approaches, with Kimball publishing The Data Warehousing Toolkit. The 2000s saw MPP databases and the birth of big data and Hadoop. Today, we have cloud data storage offerings and a new data processing approach, the unified programming model (UPM). Let’s briefly explore how we got here and what lays ahead.
ETL (Extract Transform Load)
Many productivity tools rose to prominence to support growing need for sourcing, cleaning, transforming, loading, storing, and accessing data in the data warehouse. Initially, data came into DWs in nightly batches from departmental applications in a structured tabular format. The ETL approach and tooling were introduced with their visual data transformation capabilities so as not to have to code data transformations by hand in the language de jure or plain SQL. Talend was one of the data integration companies that chose an open source approach to ETL. Multiple relational database management systems (RDBMS) rose to prominence to host enterprise data in DW allowing to store data on a large multi-CPU multi-disk machines. Data warehouses capacity was measured in gigabytes.
ELT (Extract Load Transform)
Eventually, as the number of enterprise applications and their usage increased, the amount of data stored in a DW reached terabytes and outgrew single large machine capacity. A new breed of MPP database management appliances was born where data was stored in a shared nothing architecture across multiple physical machines thus sharding and parallelizing data storage and access across multiple notes. With this came a new data processing paradigm — ELT — that allowed data management professionals to offload data transformation processing to the new parallel processing capabilities that data warehousing appliances offered without having to scale their ETL infrastructure to the same scale, especially in high-velocity use cases. Unfortunately, this often meant going back to coding data transformations in SQL. Most data still came in batches from internal enterprise applications.
With the proliferation of the internet and ubiquitous connectivity, new data sources were coming online — web traffic, social media, and location services. This data was usually unstructured or semi-structured with terabyte data volumes. Big data was born with multiple machine Hadoop clusters and a MapReduce programming pattern to extract relevant insights and aggregations to be loaded into DW. ETL vendors such as Talend implemented Hadoop integrations to allow for a higher level of ETL abstraction instead of having to write low-level MapReduce code. Unfortunately, Hadoop and MapReduce were built for batch processing and as new streaming data sources came online, and business desired up to the minute analytics on near real-time data. With that need, the Lambda architecture was born with its two separate processing paths: one for batch and one for streaming. This resulted in data integrators having to code and maintain two data processing paths.
UPM (Unified Programming Model)
Today’s information supply side is growing exponentially — people/device connectivity is increasing, always on and real-time interaction applications are proliferating, and devices are becoming smart-everything. All of this is generating petabytes of new data that is streaming in at a high velocity. At the same time, traditional enterprise data sources have not stopped producing batch data that needs to be integrated with new datasets. On the information consumption side, cloud platform providers such as Google BigQuery offers petabyte scale storage capabilities at the fraction of the on-premise storage cost to store and analyze billion row data sets. Thus, the needs to integrate data from all sources, transform, cleanse, and load in a standard format keeps on growing and all it has to be applied to both streaming and batch data flows consistently. So how do we solve this data integration need? Do we do ETL, ELT, MR, Lambda, or something new? Recently, we’ve seen a different idea start to gain traction. Meet Apache Beam, an open-source framework to unify streaming and batch data processing through the Unified Programming Model (UPM).
Started in 2016 by Google, Talend, and others, the goal of Apache Beam is to “implement batch and streaming data processing jobs that run on any execution engine.” Apache Beam code can run on any Beam executor engine. With its ability to support both batch and streaming with the same programming model, there is no more need for a Lambda architecture with two parallel code execution paths for batch and streaming. Developers now can write a single Apache Beam transformation code once that works for both traditional bounded batch data sets and continuous unbounded infinite streaming data sources such as IoT device data.
Google Cloud DataFlow is an Apache Beam runner on Google Cloud Platform. Google Cloud Dataflow offers no-ops serverless auto-scalable processing. It will scale execution nodes up and down based on data volume. Google Cloud Dataflow is well integrated with Google BigQuery for streaming inserts (Google’s data warehouse in the cloud offering).
This sounds all very exciting, but there must be a catch? The biggest difficulty with Apache Beam and UPM is the fairly low-level programming that needs to be done in implementing transformations. An integration developer has to have a deep understanding of the Apache Beam programming model to efficiently implement transformations especially for unbounded data sources that require mastering such concepts as windowing, triggering, and watermarks. This requires data engineering skills instead of data integration skills — think Hadoop MapReduce programming in the early days.