Common sense tells us one can’t use data unless its quality is understood. Data quality checks are critical for the data lake, but it’s not unusual for companies to initially gloss over this process in the rush to move data into less costly and more scalable Hadoop storage — especially during initial adoption. After all, isn't landing data into Hadoop with little definition of schema and data quality what Hadoop is all about? After landing data in a raw zone in Hadoop, the reality quickly sets in that in order for data to useful, both structure and data quality must be applied. Defining data quality rules becomes particularly important depending on what sort of data you’re bringing into the data lake (for example, large volumes of data from machines and sensors). Data validation is essential because it is coming from an external environment and it probably hasn’t gone through any quality checks.
Existing users of Hadoop who may already have data in a data lake that hasn’t gone through a data quality process as a standard operating procedure needn’t worry. There are a number of best practices for validating data whether you’re still planning for a Hadoop implementation or you already have a data lake. No matter what stage of data maturity you are in, you can leverage the processing power of Hadoop to run your data quality checks, while leveraging the natural parallelism of Hadoop along with its financial benefits.
The Concept of Data Quality in Hadoop
First, what do we mean by data quality? Data quality in Hadoop is not the same as data quality in a traditional data warehouse where partial records are often rejected. One of the benefits of Hadoop is that you can keep all of your raw data in its native format and use or transform the parts of data sets that pass a quality threshold for a particular use case. For example, a data set may not have complete address information but is still useful because it contains the zip codes needed for an analysis.
A useful way to think about it is that data quality in Hadoop isn’t always about cleansing data to fit a particular schema; instead, it’s about evaluating the data to know what you have and then determining later if it is useful for a particular use case. This becomes especially obvious when one considers non-structured or semi-structured use cases in which data quality can take on a variety of meanings especially with binary data for example.
Maximize Efficiency: Check Data Quality Upon Ingestion
To evaluate data quality at the scale of big data and reduce errors, automation is the key to success. Use of a data management platform to automatically validate data during ingest is the key to moving data from its raw form into a more consumable format for both production use cases or for discovery activities by data scientists. Automation is the key to not just storing data at scale but to making the data useful to the business as fast as possible, leveraging Hadoop's natural ability to do work in parallel to enable the right time to value.
Give Your Data Warehouse a Break
Use of data quality actions in Hadoop as part of an ETL/ingestion process also allows movement of this process out of the traditional data warehouse to a less expensive, more scalable platform. The basic use of Hadoop in-house has been the traditional answer. We increasingly see the use cloud services and a data lake management platform like Bedrock to provide the orchestration of data preparation activities across physical, virtual, and hybrid cloud environments. This also includes the use of transient clusters like Amazon EMR with data stored in S3 as persistent storage.
Use a Zone Defense
Zaloni pairs the use of a data lake management platform with a recommend strategy for zones in the data lake — specifically, landing, raw, trusted, refined, and sandbox zones and using rules related to data quality, security, and privacy (i.e., masking and tokenization) as a part of the automated movement of data between the zones. The zone your data is in indicates the degree of confidence, the level of access, or the appropriate use of your data.
Standardize Data Validation
Data quality processes are based on setting functions, rules, and rule sets that standardize the validation of data across data sets. Here’s a simplistic overview: functions are the most basic (i.e., a number is greater than another number) and can be combined to create rules (i.e., data can’t be null and must be greater than 10). Then, rules can be combined to create rule sets (i.e., check all fields and make sure there’s a valid email address). You then determine what validation processes and hierarchy of rules apply to what data or data sets.
For example, a simple function (i.e., Is this number greater than zero?) may be adequate for some data, while other data may need to be validated by a more complex hierarchy of rules. Often, the level of required validation is influenced by legacy restrictions or internal processes that are already in place, so it’s a good idea to evaluate your company’s existing processes before setting your rules. The most important tip? Automate and standardize your data quality check process as soon as possible.