Why Are We Treating Data Like a Picasso?
Why Are We Treating Data Like a Picasso?
The models of Provenance, Lineage, and Chain of Custody are used in fine art to determine a lot of information all with the purpose of authenticating the piece. What does this have to do with boring data?
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Provenance, Lineage, & Chain of Custody
The models of Provenance, Lineage, and Chain of Custody are used in fine art to determine when a piece was created, the sequence of locations where it was held, how it was touched along the way, and who has owned it since creation, all with the purpose of authenticating the piece. What does this have to do with boring data?
It turns out many decisions which affect our daily lives are made using a single final result – or score – which is derived from many other pieces of data. What if one of those pieces of data was wrong or stale? This could lead to “Bad Data”, and the consequences can range from the inconvenient to the catastrophic. We must understand the data components used to calculate a final number to ensure the result is valid and current; this is why we need to adopt the models of Data provenance, Data Lineage and Data Chain of Custody, and make them an intrinsic part of any data driven decision.
Let me start with a few Examples:
- In September 2008, a report flashed across trading screens saying United Airlines had filed for bankruptcy, this provoked investor panic, and sent UAL stock plummeting more than 75%. The (undated) article turned out to be six years old, concerning the 2002 bankruptcy of UAL, United Airlines’ parent company, and it appeared on the list of top news stories from Google News. The article was six years out of date.
- In May 1999, five US JDAM guided bombs hit the Chinese embassy in Belgrade during the NATO bombing. George Tenet, director of the CIA, attributed this to a mistake caused by three basic failures: First, they had the wrong coordinates; Second, none of the military databases used to validate the targets contained the correct information; Third, nowhere in the target review process was either of the two mistakes detected.
- Bad data has been widely accepted as a major factor in the Financial Crisis of 2008. Saul Hansell. New York Times Bits Blog, September 18, 2008, How Wall Street Lied to Its Computers. says: “The people who ran the financial firms chose to program their risk-management systems with overly optimistic assumptions and to feed them oversimplified data.” Financial regulators seem to agree and have passed a series of regulations related to data as well (see The Tortoise and the Hare in Wall Street).
The cost of “Bad Data” ranges from TDWI (The Data Warehousing Institute) estimate of $611 billion each year for U.S. firms, to IBM’s $3.1 trillion per year figure, either figure is simply staggering, not to mention the individual lives affected by this.
The causes of Bad Data typically fall into these categories:
- Bad Source: Data sourced from the wrong place, or entered incorrectly
- Undocumented Alteration: Data which is altered along the way and not documented
- Wrong Use: Data modified for a specific purpose which does not fit other uses
- Stale: Data that is outdated
The right solution needs to address all these issues under the umbrella of Data Governance, and it must provide a full audit trail to record and verify all events that could change every piece of data going into a meaningful calculation. It must enable enterprises to have the proper tracking and monitoring of data via Data Provenance, Data Lineage, and Data Chain of Custody.
Data Provenance refers to the “origin” and “source” of data – where a piece of data came from and the process by which came to be in its present state.
Data Lineage is the process of tracing and recording the origins of data and its movement between databases or systems; it tracks the data life cycle from its origin to its destination over time, and what happens as it goes through diverse processes.
Chain of custody refers to the indelible record that captures the original data, who may have accessed or modified it during its lifetime, records how the data changed, and where and when there was a transfer of possession.
Data provenance needs to allow the user to see how a piece of data flowed through the system, replay it at any stage in the flow, store what happened to the data before and after key stages, thereby simplifying data flows that are often large, complex directed graphs involving transformations, forks, and joins.
A great solution to solve this problem came from an unexpected source, the National Security Agency (NSA). The NSA could not find a commercial solution which had at its core the data governance, security and audit capabilities they needed to move massive amounts of data securely from multitude of sources. So they decided to implement it in-house more than ten years ago. The project is called Apache NiFi, and it was submitted to the Apache Software Foundation in November of 2014 as part of the NSA Technology Transfer Program, making it an open source software project.
Apache NiFi was implemented to solve two basic problems: First, move massive amounts of data, from many sources and varieties, securely and effectively; and Second, to have the embedded Data Governance built directly into the system to trace the data from beginning to end.
Every piece of data that flows through Apache NiFi is listed for chain of custody, lineage and data provenance analysis.
Once a piece of data is chosen, further inspection of it’s lineage can be viewed. The picture shows how a piece of data was received, forked and routed between systems.
By furthering inspecting this flow, one can gain provenance information on how that piece of data was handled and processed along the way. This means full knowledge where the data came from, who modified it along the way and how the each reported number is calculated.
Data is only reliable when the sources and process used to create a result set are traceable, reproducible, and visible to those responsible for the results. This requires an infrastructure designed from the ground up to implement the proper Data Governance and track data through all the transformations, from source to end result, and guarantees that the integrity and reliability of the provenance records cannot be rewritten.
To quote Tim Berners-Lee: “Data is a precious thing and will last longer than the systems themselves.”
Published at DZone with permission of Diego Baez , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.