Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

WhereHows: An Open Source Lineage and Annotation Tool for Data Collections

DZone's Guide to

WhereHows: An Open Source Lineage and Annotation Tool for Data Collections

LinkedIn's software solves the problem of traceability for large collections of data that are continually transformed by different toolchains.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Managing the changes made to any given dataset becomes a challenge as the scope of the organization grows, especially at the level of enterprise. 

WhereHows was recently released as open source by its developer, LinkedIn, which manages vast amounts of data (about 50,ooo datasets, 14,000 comments, and 35 million job execution records). The name is a compound of two important attributes of data: "WHERE is the data, and HOW is it produced/consumed."

The Life of Big Data

LinkedIn created its tool to solve a problem more and more common in the Big Data ecosystem: the "life" of a dataset within an organization is complicated. Data is created, imported, transformed, segemented, converted, and otherwise altered by the tools an organization uses. 

Figure 1: WhereHows Lineage

Figure 1, above, is the "lineage" representation of the system. The analogy conceptually organizes the flow of data across an organization.

Figure 2: Architecture

Figure 2, right, diagrams the system architecture of WhereHows:

From addends of each external system, WhereHows builds a single data model made of four components:

  • Datasets themselves
  • Operational data flow 
  • Lineage data
  • WhereHows ETL and Web UI/Service

Using WhereHows, even very large organizations can benefit from a standard for storing, tracking, and annotating data about data.


Resources

WhereHows GitHub

WhereHowsWiki

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
hadoop ,teradata ,oozie

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}