WhereHows: An Open Source Lineage and Annotation Tool for Data Collections
LinkedIn's software solves the problem of traceability for large collections of data that are continually transformed by different toolchains.
Join the DZone community and get the full member experience.Join For Free
Managing the changes made to any given dataset becomes a challenge as the scope of the organization grows, especially at the level of enterprise.
WhereHows was recently released as open source by its developer, LinkedIn, which manages vast amounts of data (about 50,ooo datasets, 14,000 comments, and 35 million job execution records). The name is a compound of two important attributes of data: "WHERE is the data, and HOW is it produced/consumed."
The Life of Big Data
LinkedIn created its tool to solve a problem more and more common in the Big Data ecosystem: the "life" of a dataset within an organization is complicated. Data is created, imported, transformed, segemented, converted, and otherwise altered by the tools an organization uses.
Figure 1, above, is the "lineage" representation of the system. The analogy conceptually organizes the flow of data across an organization.
Figure 2, right, diagrams the system architecture of WhereHows:
From addends of each external system, WhereHows builds a single data model made of four components:
- Datasets themselves
- Operational data flow
- Lineage data
- WhereHows ETL and Web UI/Service
Using WhereHows, even very large organizations can benefit from a standard for storing, tracking, and annotating data about data.
Opinions expressed by DZone contributors are their own.