Incremental Jobs and Data Quality Are On a Collision Course

Big data isn’t dead; it’s just going incremental. But bad things happen when uncontrolled changes collide with incremental jobs. Reacting to changes is a losing strategy.

Jack Vanlightly

Jan. 01, 25 · Opinion

Likes (0)

Comment

Save

1.5K Views

If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn — one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets.

Why?

On the one hand, many data sets are inherently small, corresponding to things like people, products, marketing campaigns, sales funnel, win/loss rates, etc. On the other hand, there are inherently large data sets (such as clickstreams, logistics events, IoT, sensor data, etc) that are increasingly being processed incrementally.

Why the Trend Towards Incremental Processing?

Incremental processing has a number of advantages:

It can be cheaper than recomputing the entire derived dataset again (especially if the source data is very big).
Smaller precomputed datasets can be queried more often without huge costs.
It can lower the time to insight. Rather than a batch job running on a schedule that balances cost vs timeliness, an incremental job keeps the derived dataset up-to-date so that it’s only minutes or low-hours behind the real world.
More and more software systems act on the output of analytics jobs. When the output was a report, once a day was enough. When the output feeds into other systems that take actions based on the data, these arbitrary delays caused by periodic batch jobs make less sense.

Going incremental, while cheaper in many cases, doesn’t mean we’ll use less compute though. The Jevons paradox is an economic concept that occurs where technological advancements leading to increased efficiency in the use of a resource lead to a paradoxical increase in the overall consumption of that resource rather than a decrease. Greater resource efficiency leads people to believe that we won’t use as much of the resource, but the reality is that this often causes more consumption of the resource due to greater demand.

Using this intuition of the Jevons Paradox, we can expect this trend of incremental computation to lead to more computing resources being used in analytics rather than less.

We can now:

Run dashboards with lower refresh rates.
Generate reports sooner.
Utilize analytical data in more user-facing applications.
Utilize analytical data to drive actions in other software systems.

As we make analytics more cost-efficient in lower latency workloads, the demand for those workloads will undoubtedly increase (by finding new use cases that weren’t economically viable before). The rise of GenAI is another driver of demand (though definitely not making analytics cheaper!).

Many data systems and data platforms already support incremental computation:

Real-time OLAP:
- ClickHouse/Apache Pinot/Apache Druid all provide incremental precomputed tables.
Cloud DWH/lake house
- Snowflake materialized views.
- Databricks DLT.
- DBT incremental jobs.
- Apache Spark jobs.
- Incremental capabilities of the open table formats
- Incremental ingestion jobs.
Stream processing
- Apache Flink.
- Spark Structured Streaming
- Materialize (a streaming database that maintains materialized views over streams).

While the technology for incremental computation is already largely here, many organizations aren’t actually ready for a switch to incremental from periodic batch.

The Collision Course

Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data.
– Julien Le Dem

The collision:

Bad things happen when uncontrolled changes collide with incremental jobs that feed their output back into other software systems or pollute other derived data sets. Reacting to changes is a losing strategy
– Jack Vanlightly

Many, if not most, organizations are not equipped to realize this future where analytics data drives actions in other software systems and is exposed to users in user-facing applications. A world of incremental jobs raises the stakes on reliability, correctness, uptime (freshness), and general trustworthiness of data pipelines. The problem is that data pipelines are not reliable enough nor cost-effective enough (in terms of human resource costs) to meet this incremental computation trend.

We need to rethink the traditional data warehouse architecture where raw data is ingested from across an organization and landed in a set of staging tables to be cleaned up serially and made ready for analysis. As we well know, that leads to constant break-fix work as data sources regularly change, breaking the data pipelines that turn the raw data into valuable insights. That may have been tolerable when analytics was about strategic decision support (like BI), where the difference of a few hours or a day might not be a disaster. But in an age where analytics is becoming relevant in operational systems and powering more and more real-time or low-minute workloads, it is clearly not a robust or effective approach.

The ingest-raw-data->stage->clean->transform approach has a huge amount of inertia and a lot of tooling, but it is becoming less and less suitable as time passes. For analytics to be effective in a world of lower latency incremental processing and more operational use cases, it has to change.

So, What Should We Do Instead?

The barrier to improving data pipeline reliability and enabling more business-critical workloads mostly relates to how we organize teams and the data architectures we design. The technical aspects of the problem are well-known, and long-established engineering principles exist to tackle them.

The thing we’re missing right now is that the very foundations that analytics is built on are not stable. The onus is on the data team to react quickly to changes in upstream applications and databases. This is clearly not going to work for analytics built on incremental jobs where expectations of timeliness are more easily compromised. Even for batch workloads, the constant break-fix work is a drain on resources and also leads to end users questioning the trustworthiness of reports and dashboards.

The current approach of reacting to changes in raw data has come about largely because of Conway’s Law: how the different reporting structures have isolated data teams from the operational estate of applications and services. Without incentives for software and data teams to cooperate, data teams have, for years and years, been breaking one of the cardinal rules for how software systems should communicate. Namely, they reach out to grab the private internal state of applications and services. In the world of software engineering, this is an anti-pattern of epic proportions!

It’s All About "Coupling"

I could make a software architect choke on his or her coffee if I told them my service was directly reading the database of another service owned by a different team.

Why is this such an anti-pattern? Why should it result in spilled coffee and dumbfounded shock? It’s all about coupling. This is a fundamental property of software systems that all software engineering organizations take heed of.

When services depend on the private internal workings of other services, even small changes in one service's internal state can propagate unpredictably, leading to failures in distant systems and services. This is the principle of coupling, and we want low coupling. Low coupling means that we can change individual parts of a system without those changes propagating far and wide. The more coupling you have in a system, the more coordination and work are required to keep all parts of the system working. This is the situation data teams still find themselves in today.

For this reason, software services expose public interfaces (such as a REST API, gRPC, GraphQL, a schematized queue, or a Kafka topic), carefully modeled, stable, and with careful evolution to avoid breaking changes. A system with many breaking changes has high coupling. In a high coupling world, every time I change my service, I force all dependent services to update as well. No, we either have to perform costly coordination between teams to update services (at the same time) or we get a nasty surprise in production.

That is why in software engineering, we use contracts, and we have versioning schemes such as SemVer to govern contract changes. In fact, we have multiple ways of evolving public interfaces without propagating those changes further than they need to. It’s why services depend on contracts and not private internal state.

Not only do teams build software that communicates via stable APIs, but the software teams collaborate to provide those APIs that the various teams require. This need for APIs and collaboration has only become larger over time. The average enterprise application or service used to be a bit of an island: it had its ten database tables and didn't really need much more. Increasingly, these applications are drawing on much richer sets of data and forming much more complex webs of dependencies. Given this web of dependencies between applications and services, (1) the number of consumers of each API has risen, and (2) the chance of some API change breaking a downstream service has also risen massively.

Stable, versioned APIs between collaborating teams are the key.

Data Products (Seriously)

This is where data products come in. Like or loathe the term, it’s important.

Rather than a data pipeline sucking out the private state of an application, it should consume a data product. Data products are very similar to the REST APIs on the software side. They aren’t totally the same, but they share many of the same concerns:

Schemas. The shape of the data, both in terms of structure (the fields and their types) and the legal values (not null, credit card numbers with 16 numbers, etc).
Careful evolution of schemas to prevent changes from propagating (we want low coupling). Avoiding breaking changes as much as humanly possible.
Uptime, which for data products becomes “data freshness.” Is the data arriving on time? Is it late? Perhaps an SLO or even an SLA determines the data freshness goals.

Concretely, data products are consumed as governed data-sharing primitives, such as Kafka topics for streaming data and Iceberg/Hudi tables for tabular data. While the public interface may be a topic or a table, the logic/infra that produces the topic or table may be varied. We really don’t want to just emit events that are mirrors of the private schema of the source database tables (due to the high coupling it causes). Just as REST APIs are not mirrors of the underlying database, the data product also requires some level of abstraction and internal transformation. Gunnar Morling wrote an excellent post on this topic, focused on CDC and how to avoid breaking encapsulation.

These data products should be capable of real-time or close to real-time because downstream consumers may also be real-time or incremental. As incremental computation spreads, it becomes a web of incremental vertices with edges between them: a graph of incremental computation that is spread across the operational and analytical estates. While the vertices and edges are different from the web of software services, the underlying principles for building reliable and robust systems are the same — low coupling architectures based on stable, evolvable contracts.

Because data flows across boundaries, data products should be based on open standards, just as software service contracts are built on HTTP and gRPC. They should come with tooling for schema evolution, access controls, encryption/data masking, data validation rules, etc. More than that, they should come with an expectation of stability and reliability — which comes about from mature engineering discipline and prioritizing these much-needed properties.

These data products are owned by the data producers rather than the data consumers (who have no power to govern application databases). It’s not possible for a data team to own the data product whose source is another team’s application or database and expect it to be both sustainable and reliable. Again, I could make a software architect choke on their coffee, suggesting that my software team should build and maintain a REST API (we desperately need) that serves the data of another team’s database.

Consumers don’t manage the APIs of source data; it’s the job of the data owner, aka the data producer. This is a hard truth for data analytics but one that is unquestioned in software engineering.

The Challenge Ahead

What I am describing is Shift Left applied to data analytics. The idea of shifting left is acknowledging that data analytics can’t be a silo where we dump raw data, clean it up, and transform it into something useful. It’s the way it has been done for so long with multi-hop architectures it’s really hard to consider something else. But look at how software engineers build a web of software services that consume each other's data (in real-time) – software teams are doing things very differently.

The most challenging aspect of Shift Left is that it changes roles and responsibilities that are now ingrained in the enterprise. This is just how things have been done for a long time. That’s why I think Shift Left will be a gradual trend as it has to overcome this huge inertia.

The role of data analytics systems has gone from reporting alone to now including or feeding running-the-business applications. Delaying the delivery of a report for a few hours was tolerable, but in operational systems, hours of downtime can mean huge amounts of lost revenue, so the importance of building reliable (low-coupling) systems has increased.

What is holding back analytics right now is that it isn’t reliable enough, it isn’t fast enough, and it has the constant drain of reacting to change (with no control over the timing or shape of those changes). Organizations that shift responsibility for data to the left will build data analytics pipelines that source their data from reliable, stable sources. Rather than sucking in raw data from across the enterprise and dealing with change as it happens, we should build incremental analytics workloads that are robust in the face of changing applications and databases.

Ultimately, it’s about:

Solving a people problem (getting data and software teams to work together).
Applying sound engineering practices to create robust, low-coupling data architectures that can be fit for purpose for more business-critical workloads.

The trend of incremental computation is great, but it only raises the stakes.

Big data Data quality

Published at DZone with permission of Jack Vanlightly. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending