DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > WhereHows: An Open Source Lineage and Annotation Tool for Data Collections

WhereHows: An Open Source Lineage and Annotation Tool for Data Collections

LinkedIn's software solves the problem of traceability for large collections of data that are continually transformed by different toolchains.

Rodrigo Kyle Mehren user avatar by
Rodrigo Kyle Mehren
·
Mar. 15, 16 · Big Data Zone · News
Like (3)
Save
Tweet
12.26K Views

Join the DZone community and get the full member experience.

Join For Free

Managing the changes made to any given dataset becomes a challenge as the scope of the organization grows, especially at the level of enterprise. 

WhereHows was recently released as open source by its developer, LinkedIn, which manages vast amounts of data (about 50,ooo datasets, 14,000 comments, and 35 million job execution records). The name is a compound of two important attributes of data: "WHERE is the data, and HOW is it produced/consumed."

The Life of Big Data

LinkedIn created its tool to solve a problem more and more common in the Big Data ecosystem: the "life" of a dataset within an organization is complicated. Data is created, imported, transformed, segemented, converted, and otherwise altered by the tools an organization uses. 

Figure 1: WhereHows Lineage

Figure 1, above, is the "lineage" representation of the system. The analogy conceptually organizes the flow of data across an organization.

Figure 2: Architecture

Figure 2, right, diagrams the system architecture of WhereHows:

  • Hadoop
  • Teradata
  • Azkaban
  • Oozie

From addends of each external system, WhereHows builds a single data model made of four components:

  • Datasets themselves
  • Operational data flow 
  • Lineage data
  • WhereHows ETL and Web UI/Service

Using WhereHows, even very large organizations can benefit from a standard for storing, tracking, and annotating data about data.


Resources

WhereHows GitHub

WhereHowsWiki

Big data Open source Annotation

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Debugging Deadlocks and Race Conditions
  • Java: Why Core-to-Core Latency Matters
  • How to Properly Format SQL Code
  • Autowiring in Spring

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo