DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • The 4 R’s of Pipeline Reliability: Designing Data Systems That Last
  • Setting Up Data Pipelines With Snowflake Dynamic Tables
  • Building Scalable and Resilient Data Pipelines With Apache Airflow
  • ETL Generation Using GenAI

Trending

  • Docker Base Images Demystified: A Practical Guide
  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  • Medallion Architecture: Why You Need It and How To Implement It With ClickHouse
  • Top Book Picks for Site Reliability Engineers
  1. DZone
  2. Data Engineering
  3. Data
  4. Is Data Lineage a Pain Killer or Vitamin?

Is Data Lineage a Pain Killer or Vitamin?

Discover how data lineage is used by organizations, its benefits, and the critical questions to ask before implementation. Learn from real customer insights.

By 
Yuliia Tkachova user avatar
Yuliia Tkachova
·
May. 21, 24 · Opinion
Likes (1)
Comment
Save
Tweet
Share
889 Views

Join the DZone community and get the full member experience.

Join For Free

TL;DR: I might be biased on this, but I’m also equipped with analytics on column-level lineage usage from a number of customers and users.

Data lineage image

Image courtesy of the Masthead Data team: Data Lineage

Is Data Lineage a Pain Killer or Vitamin?

First, it very much depends on the user organization’s current use cases and their level of maturity.

In my humble opinion, data engineers love looking at data flows and have that visual understanding of dependencies, but do they really use data lineage at the end of the day? What is the usage frequency? What are the specific use cases?

From what we observed, data lineage certainly drives interest. However, when it comes to actual usage, it is not the central feature. This could be because our implementation is limited to some data sources. However, having lineage limited to only some pipelines also seems less meaningful to me (i.e., lineage in dbt or Dataform), as ingestion and other processes are left in shades. A typical use case might involve someone in the organization searching for a specific pipeline or model about twice a week for a few minutes.

Common Uses for Data Lineage

The most common use cases for lineage we saw were:

  • The company is migrating or rebuilding its data platform.
  • The organization is onboarding new teammates, often for new data initiatives.

These are the times when lineage becomes very handy. Basically, it’s when the company starts not just maintaining what is in their data warehouse or data lake, but actually building and modernizing the data ecosystem.

Does this mean that having lineage is a must in this case? Absolutely not. But if you are interested in moving faster and smarter, then the answer is absolutely yes.

Questions To Consider

So, it very much depends on what the organization is currently doing. I am not trying to be assertive here, but rather intelligently honest by asking if you really need data lineage. You might want to start with questions like: 

  1. What is it for?
  2. What level of coverage do you need?
  3. Does it need to visualize production sources, or is a data warehouse enough?
  4. Do you need a BI solution connected? If yes, to what extent?

Then you speak to the universe and decide: buy or build. There’s a lot to consider here. My take is as follows:

  • Will it be used by the data team only, or will business users also be involved? (Consider the level of UX/UI required.)
  • How much are you ready to invest in it? (Calculate the cost of building it internally at the expense of your team’s hours and compare it to purchasing from a vendor.) Please, do not forget to double the hours your team initially promised to you. Hear me out; I'm speaking as a product manager here.
  • Consider what you have already in your data platform: data lake, using third-party data sources, and the stack already in use by the data team. It sounds easy and fun until you start dealing with complex cases like cross-project dependencies, views of temporary tables, or, heaven forbid, sharded tables, etc., and the list goes on.
  • What is your team’s strategic focus and their skill set? Is it a strategic investment for you, and do you have the capacity to maintain and evolve it? Because your data platform, whether you believe it or not, will evolve.

Conclusion

Ultimately, my personal belief is that data lineage as a standalone visualization is not effective. Our use case for data lineage is to help troubleshoot broken pipelines or model errors because when organizations have an active warehouse with hundreds of pipelines and thousands of tables, it is impossible to keep track of if everything is working as expected. When we are talking about data quality, those are SQL rules and something already anticipated and known, but pipelines and models are a different beast. It is a lot about connectivity, compatibility, and effectiveness of the data platforms. Pairing data pipeline/model error detection and data lineage is the area where we see a lot of response and value for users. Additionally, it helps our clients save money as it is also connected to cost insights.

Having lineage alone does not solve the problem; it creates a new one. No one understands how the solution is being used because lineage alone does not move the needle. It rather helps to move it faster in combination with anomaly detection and pipeline error detection.

While data lineage alone may be seen as just another shining tool, its true value emerges when paired with comprehensive monitoring mechanisms and a commitment from the organization and the data team to build up a robust and reliable data platform.

Data (computing) Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • The 4 R’s of Pipeline Reliability: Designing Data Systems That Last
  • Setting Up Data Pipelines With Snowflake Dynamic Tables
  • Building Scalable and Resilient Data Pipelines With Apache Airflow
  • ETL Generation Using GenAI

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!