Data Quality, Lineage, and Cataloging
Data Lakes now present a number of challenges once they become the standard source for your data, including how to provide access to them with data quality, lineage, and catalogs.
Join the DZone community and get the full member experience.Join For Free
When an organization has mastered the use of automated data ingest and the appropriate application of metadata, there are a number of additional concerns to be addressed with using data at scale. These include data quality, data lineage, and a searchable data catalog. All of these are factors in presenting an effective and useful data catalog. The data catalog is the foundation of the self-service capability for a business-facing data presentation and transformation layer.
It is best to apply data quality rules in an automated fashion. Simple rules can be defined, like operators in a programming language, corresponding to atomic operations such as greater than or less than. Those simple operators can be hierarchically organized into a collection of operations to establish basic rules such as social security number validation. Those rules can also be logically organized into collections to create “rule sets”. For example, a rule set might contain all the data quality operations that are needed for a specific data feed. Business transactions that require application validation and pre-processing such as loan processing or credit line increase request processing. All the fields in that set of data can be analyzed automatically, creating clean records for upstream analytics automatically. This type of data quality pre-processing is exactly what we as consumers of data should be demanding of our data ingest process.
Knowing where data has come from and how it was transformed not only by data quality rules but by all of the transformations required by business rules specific to the use case at hand is invaluable. Being able to retrace the steps in data ingest and processing is critical to many data users in terms of regulatory requirements. Data in both banking and pharmaceutical research, for example, must be trackable based upon regulations in the respective industries. At a lower level, a simple data engineering level, being able to track data lineage is also a very valuable debugging tool. This includes both business process debugging (i.e., is our business process correct) but also from a simple data science perspective (i.e., are advanced algorithms functioning correctly). With Bedrock, all of these concerns can be answered by the proper application of maintaining proper data lineage for datasets over time.
A functioning data catalog seems like such a simple concept in this day and age. The fact is that many organizations still face the challenge of simply finding data. Having data is not the problem. Everyone has data now. What many are missing is an effective method for finding datasets in what can be a sea of ingested data. Part of the issue is simple organization. Having a process of ingestion that cleans, tags and lands data in a known location while partitioning it along the way. Even then, finding data sets from which to start developing new analytics can be challenging without an effective way to rapidly search it. In this case, search doesn't mean SQL queries on a dataset but really the search of metadata such that the right dataset can found. Some groups rely on a single person or group of people to be the human curator of datasets for an organization. Some groups rely on a system of codes embedded in the directory structures and file names as an attempt to make data discoverable. Having an easily searchable data catalog is the main key to eventually maturing an organizational data culture toward self-service operations.
Self-Service Data Presentation and Transformation
In terms of operational maturity, the goal of many organizations is to provide IT services in a self-service mode. That is, make the presentation of that service “easy” while simultaneously addressing the appropriate technical depth to be useful to the end user. The use of a properly groomed data catalog allows independent metadata investigation by analysts and data scientist who are experts in advanced analytics. They are also more closely aligned with the business process requirements than perhaps the data engineer who typically has more of an IT process focus. The analytical consumer of the data often has insight into data transformation process improvement that long term should rightfully but pushed back into the automated process of data transformation during data ingestion ahead of the self-service layer. A self-service portal like Mica enables such a feedback loop in that it provides access to the right users who can iterate over the data to discover additional transforms that become permanent process improvements. In the other direction, a self-service portal allows more rapid access to clean data for the discovery of new analytics and improvement of existing analytics by a variety of business data consumers.
Automation of data processing is the perfect use of a computer's time. Manual intervention is not only unnecessary now it's actually a critical business problem. As data begins to move faster and technology trends like the Internet of Things provide data to us faster, automation will be the key. Placing automated policies in place to capture, clean and tag data then feeding that data into a system in which data sets are discoverable for analytic application and analytic creation will be important to all organizations. Regardless of data format, data retention time or delivery method automated self-enabling data technologies will provide the operational flexibility to empower tomorrow's advanced analytics.
For more information about optimizing data lake management and governance through automated processes, view the Data Lake 360° Solution.
Published at DZone with permission of Adam Diaz, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.