ChatGPT, ZeroETL, and Other Data Engineering Disruptors
Zero-ETL, AI, One Big Table, and other disruptors could radically change data engineering to create a post-modern data stack. Are we ready?
Join the DZone community and get the full member experience.Join For Free
If you don’t like change, data engineering is not for you. Little in this space has escaped reinvention.
The most prominent, recent examples are Snowflake and Databricks, disrupting the concept of the database and ushering in the modern data stack era.
As part of this movement, Fivetran and DBT fundamentally altered the data pipeline from ETL to ELT. Hightouch interrupted SaaS eating the world in an attempt to shift the center of gravity to the data warehouse. Monte Carlo joined the fray and said, “Maybe having engineers manually code unit tests isn’t the best way to ensure data quality.”
Today, data engineers continue to stomp on hard-coded pipelines and on-premises servers as they march up the modern data stack slope of enlightenment. The inevitable consolidation and trough of disillusionment appear at a safe distance on the horizon.
And so it almost seems unfair that new ideas are already springing up to disrupt the disruptors:
- Zero-ETL has data ingestion in its sights
- AI and Large Language Models could transform transformation
- Data product containers are eyeing the table’s thrown as the core building block of data
Are we going to have to rebuild everything (again)? Hell, the body of the Hadoop era isn’t even all that cold.
The answer is yes, of course, we will have to rebuild our data systems. Probably several times throughout our careers. The real questions are the why, when, and the how (in that order).
I don’t profess to have all the answers or a crystal ball. But this article will closely examine some of the most prominent near(ish) future ideas that may become part of the post-modern data stack as well as their potential impact on data engineering.
Practicalities and Tradeoffs
The modern data stack reigns supreme because it supports use cases and unlocks value from data in ways that were previously, if not impossible, then certainly very difficult. Machine learning moved from buzzword to revenue generator. Analytics and experimentation can go deeper to support bigger decisions.
The same will be true for each of the trends below. There will be pros and cons, but what will drive adoption is how they, or the dark horse idea we haven’t yet discovered, unlock new ways to leverage data. Let’s look closer at each.
What it is: A misnomer for one thing; the data pipeline still exists.
Today, data is often generated by a service and written into a transactional database. An automatic pipeline is deployed, which not only moves the raw data to the analytical data warehouse but modifies it slightly along the way.
For example, APIs will export data in JSON format, and the ingestion pipeline will need to not only transport the data but apply light transformation to ensure it is in a table format that can be loaded into the data warehouse. Other common light transformations done within the ingestion phase are data formatting and deduplication.
While you can do heavier transformations by hard-coding pipelines in Python, and some have advocated for doing just that to deliver data pre-modeled to the warehouse, most data teams choose not to do so for expediency and visibility/quality reasons.
Zero-ETL changes this ingestion process by having the transactional database do the data cleaning and normalization prior to automatically loading it into the data warehouse. It’s important to note the data is still in a relatively raw state.
At the moment, this tight integration is possible because most zero-ETL architectures require both the transactional database and data warehouse to be from the same cloud provider.
Pros: Reduced latency. No duplicate data storage. One less source for failure.
Cons: Less ability to customize how the data is treated during the ingestion phase. Some vendor lock-in.
Who’s driving it: AWS is the driver behind the buzzword (Aurora to Redshift), but GCP (BigTable to BigQuery) and Snowflake (Unistore) all offer similar capabilities. Snowflake (Secure Data Sharing) and Databricks (Delta Sharing) are also pursuing what they call “no copy data sharing.” This process actually doesn’t involve ETL and instead provides expanded access to the data where it’s stored.
Practicality and value unlock potential: On one hand, with the tech giants behind it and ready-to-go capabilities, zero-ETL seems like it’s only a matter of time. On the other, I’ve observed data teams decoupling rather than more tightly integrating their operational and analytical databases to prevent unexpected schema changes from crashing the entire operation.
This innovation could further lower the visibility and accountability of software engineers toward the data their services produce. Why should they care about the schema when the data is already on its way to the warehouse shortly after the code is committed?
With data streaming and micro-batch approaches seeming to serve most demands for “real-time” data at the moment, I see the primary business driver for this type of innovation as infrastructure simplification. And while that’s nothing to scoff at, the possibility of no copy data sharing to remove obstacles to lengthy security reviews may result in greater adoption in the long run (although, to be clear, it’s not an either/or).
One Big Table and Large Language Models
What it is: Currently, business stakeholders need to express their requirements, metrics, and logic to data professionals, who then translate it all into a SQL query and maybe even a dashboard. That process takes time, even when all the data already exists within the data warehouse. Not to mention on the data team’s list of favorite activities, ad-hoc data requests rank somewhere between a root canal and documentation.
There is a bevy of startups aiming to take the power of large language models like GPT-4 to automate that process by letting consumers “query” the data in their natural language in a slick interface. At least until our newly sentient robot overlords make binary the official galactic language.
Published at DZone with permission of Shane Murray. See the original article here.
Opinions expressed by DZone contributors are their own.
How To Backup and Restore a PostgreSQL Database
13 Impressive Ways To Improve the Developer’s Experience by Using AI
Observability Architecture: Financial Payments Introduction
Batch Request Processing With API Gateway