Apache Hadoop® exists within a broader ecosystem of enterprise analytical packages. This includes ETL tools, ERP and CRM systems, enterprise data warehouses, data marts and others. Modern workloads flow from these various traditional analytical sources into Hadoop and then often back out again.
What dataset came from which system, when and how did it change over time? The ability to answer these questions, manage the chain of custody and track cross component lineage of specific data is critical for meeting enterprise compliance and audit requirements.
How can these governance objectives for cross component lineage be achieved within Apache Hadoop? While there are some commercial solutions that show data lineage for a few Hadoop components, they require all workflows to be exclusively run through a narrow and proprietary tool set. Furthermore, to maintain compliance standards, these solutions require that other tools or access methods must be prohibited. We believe that this approach is neither open nor collaborative.
Modern, data-driven companies require an open and comprehensive approach to data governance because they use multiple components within the Hadoop ecosystem to address their varied data analytic requirements. As an example, users can stream data through Apache Kafka into Apache Storm, and store that data in HDFS. Users can also import data from traditional databases using Apache Sqoop, into an Apache Hive table. This data can then be stored as a file in HDFS or replicated and moved to a cloud location using Apache Falcon. Users need the ability to tie together data lineage across all these Hadoop components and have a unified view of how the data was created, processed and moved.
Common cross component lineage use cases for Apache Hadoop include:
- Operational: Impact analysis which is a critical requirement for multi-tenant Data Lakes
- Compliance: Chain of custody for audit and compliance reporting to reconstruct the data landscape at any point in time
- Analytical: Privacy requirement for an aggregated dataset. Lineage helps answer acceptable use case questions
With the upcoming HDP 2.5 summer release, Apache Atlas will provide the ability for users to easily track lineage across different Hadoop ecosystem components. The functionality is currently available as a technical preview, as part of the standalone Atlas-Ranger VM.
Cross Component Lineage in Atlas
Apache Atlas now provides an ability to provide a consolidate view of lineage across multiple Hadoop components. Atlas community decided to take a gradual approach to delivering comprehensive interoperability capability. Apache Hive was chosen as the starting point for this journey due to its maturity, existing footprint among current Hadoop users and the fact that it is similar in concept to existing enterprise data warehouse technologies that are subject to data governance challenges.
With the upcoming release of Apache Atlas, Community has significantly expanded the ability to track data lineage across other Hadoop components. In addition to Hive, now Atlas also has the capability to manage lineage for Apache Falcon, which manages data lifecycle, such as data replication or eviction tasks that take place based on predefined intervals.
Atlas also supports Apache Kafka and Apache Storm. If a user is ingesting data from Kafka using a Storm topology, this data will now be tracked by Atlas. The same would be true of any data that is moved using Apache Sqoop or any connectors that operate on top of Sqoop such as connector for Teradata.
With Atlas, developers now also have the flexibility to write their own custom activity reports. For example if an organization has an enterprise scheduler that resides outside the Hadoop environment, it can write directly to the REST API and increment its lineage to maintain the continuity.
Users in the upcoming release of HDP 2.5 will be able to track lineage across the following components using Atlas:
Open Comprehensive Coverage
Apache Atlas now offers the most comprehensive cross component data lineage coverage. Additionally this lineage tracking is done at the data access layer, allowing continued use of any analytic tool at the application tier that may leverage the underlying components.
Reporting connector functionality is built-in to the applicable projects. Users simply need to turn on this functionality using Apache Ambari or a CLI command. This means the integration is hardened and tested compared to an add-on application.
Enterprises can now manage data lineage across a complex ecosystem of Hadoop and custom components using Atlas. Stay tuned as we plan to continue to broaden the scope of the components whose lineage can be tracked using Atlas.
Interested in a test drive?
The Atlas-Ranger integration and cross component lineage for Hadoop is available as public preview, in the form a packaged VM image. You can download the VM from here. You can also access Cross Component Lineage tutorial to test the new Atlas features in the VM.