Is Data Lineage a Pain Killer or Vitamin?
Discover how data lineage is used by organizations, its benefits, and the critical questions to ask before implementation. Learn from real customer insights.
Join the DZone community and get the full member experience.
Join For FreeTL;DR: I might be biased on this, but I’m also equipped with analytics on column-level lineage usage from a number of customers and users.
Is Data Lineage a Pain Killer or Vitamin?
First, it very much depends on the user organization’s current use cases and their level of maturity.
In my humble opinion, data engineers love looking at data flows and have that visual understanding of dependencies, but do they really use data lineage at the end of the day? What is the usage frequency? What are the specific use cases?
From what we observed, data lineage certainly drives interest. However, when it comes to actual usage, it is not the central feature. This could be because our implementation is limited to some data sources. However, having lineage limited to only some pipelines also seems less meaningful to me (i.e., lineage in dbt or Dataform), as ingestion and other processes are left in shades. A typical use case might involve someone in the organization searching for a specific pipeline or model about twice a week for a few minutes.
Common Uses for Data Lineage
The most common use cases for lineage we saw were:
- The company is migrating or rebuilding its data platform.
- The organization is onboarding new teammates, often for new data initiatives.
These are the times when lineage becomes very handy. Basically, it’s when the company starts not just maintaining what is in their data warehouse or data lake, but actually building and modernizing the data ecosystem.
Does this mean that having lineage is a must in this case? Absolutely not. But if you are interested in moving faster and smarter, then the answer is absolutely yes.
Questions To Consider
So, it very much depends on what the organization is currently doing. I am not trying to be assertive here, but rather intelligently honest by asking if you really need data lineage. You might want to start with questions like:
- What is it for?
- What level of coverage do you need?
- Does it need to visualize production sources, or is a data warehouse enough?
- Do you need a BI solution connected? If yes, to what extent?
Then you speak to the universe and decide: buy or build. There’s a lot to consider here. My take is as follows:
- Will it be used by the data team only, or will business users also be involved? (Consider the level of UX/UI required.)
- How much are you ready to invest in it? (Calculate the cost of building it internally at the expense of your team’s hours and compare it to purchasing from a vendor.) Please, do not forget to double the hours your team initially promised to you. Hear me out; I'm speaking as a product manager here.
- Consider what you have already in your data platform: data lake, using third-party data sources, and the stack already in use by the data team. It sounds easy and fun until you start dealing with complex cases like cross-project dependencies, views of temporary tables, or, heaven forbid, sharded tables, etc., and the list goes on.
- What is your team’s strategic focus and their skill set? Is it a strategic investment for you, and do you have the capacity to maintain and evolve it? Because your data platform, whether you believe it or not, will evolve.
Conclusion
Ultimately, my personal belief is that data lineage as a standalone visualization is not effective. Our use case for data lineage is to help troubleshoot broken pipelines or model errors because when organizations have an active warehouse with hundreds of pipelines and thousands of tables, it is impossible to keep track of if everything is working as expected. When we are talking about data quality, those are SQL rules and something already anticipated and known, but pipelines and models are a different beast. It is a lot about connectivity, compatibility, and effectiveness of the data platforms. Pairing data pipeline/model error detection and data lineage is the area where we see a lot of response and value for users. Additionally, it helps our clients save money as it is also connected to cost insights.
Having lineage alone does not solve the problem; it creates a new one. No one understands how the solution is being used because lineage alone does not move the needle. It rather helps to move it faster in combination with anomaly detection and pipeline error detection.
While data lineage alone may be seen as just another shining tool, its true value emerges when paired with comprehensive monitoring mechanisms and a commitment from the organization and the data team to build up a robust and reliable data platform.
Opinions expressed by DZone contributors are their own.
Comments