DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • On-Call That Doesn’t Suck: A Guide for Data Engineers
  • Data Governance Essentials: Policies and Procedures (Part 6)
  • Process Mining Key Elements
  • Maximizing Enterprise Data: Unleashing the Productive Power of AI With the Right Approach

Trending

  • A Complete Guide to Modern AI Developer Tools
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • How the Go Runtime Preempts Goroutines for Efficient Concurrency
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Quality Monitoring — You’re Doing It Wrong

Data Quality Monitoring — You’re Doing It Wrong

Monitoring your “important” data only gets you so far. If you really want data quality coverage, it's time to go deep and broad with your data monitors.

By 
Lior Gavish user avatar
Lior Gavish
·
Oct. 03, 22 · Opinion
Likes (3)
Comment
Save
Tweet
Share
4.8K Views

Join the DZone community and get the full member experience.

Join For Free

Occasionally, we’ll talk with data teams interested in applying data quality monitoring narrowly across only a specific set of key tables. 

The argument goes something like: “You may have hundreds or thousands of tables in your environment, but most of your business value derives from only a few that really matter. That’s where you really want to focus your efforts.”

In today’s world, this approach is no longer sufficient. It’s true - you need to go “deep” on certain data sets, but without also going broad, you’re missing the boat entirely. 

As data environments grow increasingly complex and interconnected, the need to cover more bases, faster, becomes critical to achieving true data reliability. To do this, you need data monitors that drill “deep” into the data using both machine learning and user-defined rules, as well as metadata monitors to scale “broadly” across every production table in your environment and to be fully integrated across your stack.

In this post, we’ll explain why - and how - delivering reliable data requires layering multiple types of data monitors across the field, table, enterprise data warehouse, and stack levels. 

Deep Data Monitoring at the Field Level

The first type of “deep” data monitor you need to use is machine learning to generate pre-configured data quality rules to validate the data in your tables. These are best deployed when you know a set of fields in a particular table is important, but you aren’t sure how it will break. Such data monitors are effective at detecting anomalies that occur when individual values in a field are null, duplicated, malformatted, or otherwise broken. It can also identify cases when metrics deviate from their normal patterns.

You don’t need to pre-configure rules and thresholds for these types of monitors – most monitoring solutions will offer suggestions based on historical trends in the data. However, because you’re querying the data, it is compute-intensive. This makes it prohibitively expensive to scale throughout full pipelines, making them best suited for “going deep” on your high-risk tables.

Monte Carlo allows users to select key fields or tables for monitors that query the data directly for dimension tracking, field health, and JSON schema so they can apply additional, deep monitoring to critical assets while still being efficient with their compute spend.  allows users to select key fields or tables for data monitors that query the data directly for dimension tracking, field health, and JSON schema so they can apply additional, deep monitoring to critical assets while still being efficient with their compute spend.

The second type of monitor, the user-defined monitor triggers an alert when the data fails specified logic. User-defined, machine learning-powered monitors are best deployed using the most well-defined, understood logic to catch anomalies that are frequent or severe. It is typically expressed through an SQL statement.

In other words, these data monitors track data accuracy, data completeness, and general adherence to your business rules. For example, you may have a metric like a shipment time that can never be negative or null, or a field in a critical table that can never exceed 100 percent. It can even help track whether data stored in one table matches data stored in another table, or whether your metrics are consistent across multiple breakdowns and calculations. 


Deep Data Monitoring at the Table Level

At the table level, ML-based data monitors can detect anomalies like freshness (did that table update when it should have?) and volume (are there too many or too few rows?). 

But again, this approach is too costly to deploy at scale across all of your pipelines, making these monitors best reserved for a few critical assets. For example, if your CEO looks at a specific dashboard every morning at 8:00 am EST and heads will roll if the data feeding that report hasn’t been updated by then. 


The Limitations of Narrow and Deep Data Monitoring

With narrow monitoring, you rob yourself of valuable context from upstream and downstream sources.

When you only deploy user-defined and machine learning monitors on your most important tables: 

  • You miss or are delayed in detecting and resolving anomalies more evident upstream. 
  • Alerting on a key table, oftentimes dozens of processing steps removed from the root cause, will involve the wrong person and give them little context on the source of the issue or how to solve it.
  • Without understanding the dependencies in the system, your team will waste time trying to determine where to focus their monitoring attention. Maintaining that as the environment and data consumption patterns change can also be tedious and error-prone.

Broad Metadata Monitoring Across the Enterprise Data Warehouse 

“Broad” metadata monitors can cost-effectively scale across every table in your environment and are very effective at detecting freshness and volume anomalies as well as anomalies resulting from schema changes (did the way the data is organized change?).


By casting a wider net, you’re not only catching all impactful anomalies, you are also reducing time to their detection and resolution by catching them closer to the source of an incident. It’s time and resource expensive to trace anomalies back several layers and people upstream, conduct the necessary backfills, and otherwise resolve the incident.  

The more upstream in the stack you monitor for data quality, the faster time to detection.

Well-trained models, dynamic alert grouping, and granular alert routing ensure alert fatigue doesn’t become an issue.

End-To-End Integration and Log Analysis Across the Stack 

Broad coverage also entails end-to-end integration. Without analyzing the logs across each component of your stack, you are blind to how changes in one system may be causing anomalies in another. 

On an assembly line, Tesla doesn’t only put quality controls in place when the Model X is ready to ship to the customer. They check the battery, engine components, and the body at each step along the way. 

Shouldn’t we check our data pipelines at each point of failure as well? This includes automatic analysis of the pipeline’s SQL logs to understand any recent changes in the code and logic; Airflow logs to understand failures and delays; and dbt logs to understand errors, delays, and test failures.


And the only way to do this cost-effectively and at scale is by analyzing logs across the stack. This approach both covers all your points of failure and provides additional context to help you prevent future anomalies 

For example, analyzing logs at the BI layer can identify downstream users and data consumption patterns, helping you to more effectively deploy your deep monitors. Identifying logs within the enterprise data warehouse can automatically assess key components of data health such as deteriorating queries or disconnected tables. 


The Future of Monitoring Is End-To-End Data Observability

At the end of the day (or rather, the end of the pipeline), pairing deep monitoring with broad coverage gives you the best of both worlds: end-to-end visibility into your data environment and the tools necessary to quickly address any issues that do arise.

Like our software engineering and DevOps counterparts, we think it’s time to move beyond monitoring and embrace an approach that goes beyond a narrow set of metrics to more actively identify, resolve, and debug complex technical systems. Maybe it’s time we embraced data observability instead. 

Data quality IT Data (computing)

Published at DZone with permission of Lior Gavish. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • On-Call That Doesn’t Suck: A Guide for Data Engineers
  • Data Governance Essentials: Policies and Procedures (Part 6)
  • Process Mining Key Elements
  • Maximizing Enterprise Data: Unleashing the Productive Power of AI With the Right Approach

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!