So far in this series of articles, I’ve focused on how embedded and misplaced field and record delimiters can create data-quality problems when data are loaded into a data lake. I’ll now explore another category of data-quality challenge: what happens when the contents of individual fields and records don’t match the expected and required data.
Dirty data are a universal problem and a fact of life for virtually every company. Despite decades of efforts to segregate and/or fix bad data as they are migrated to an organization or between systems, it remains the case that most companies have enormous amounts of dirty data built into their backbone enterprise applications and databases. Enterprise data lakes have the potential not only to perpetuate this problem but to make it worse by simplifying and accelerating business users’ ability to access dirty data and base potentially bad decisions on them.
Garbage In, Garbage Out
In a traditional data lake, data from a source system are loaded into Hadoop without any technical metadata driving the process. In fact, the extreme flexibility of “schema on read” is considered one of the benefits of Hadoop. To create a common definition of these data we must next create a schema for the file, which is recorded in HCatalog. However, there is no verification or validation concerning whether the data actually match the schema definition. Generally speaking, the overwhelming majority of data in enterprises come from structured or semi-structured sources, where some description of the file and record layout is available, whether from databases, header files, or copybooks. A source definition can be used to validate data as they are ingested, so dirty data from source systems can be identified and potentially fixed before they inadvertently pollute the lake.
With Podium, we identify and flag data-quality problems at the point of ingest by building a robust set of data validation rules for each data source as it is loaded into the data lake. Validation rules compare the contents of each field in each record with a set of parameters and thresholds to determine whether that record contains quality data. Here is a sample of the dozens of validation values Podium uses to check each field:
|Podium validation field||Test applied to that field|
|Null?||Does this field contain a null value?|
|Value = Null Proxy?||Does this field contain a null proxy value?|
|Control Characters||Does this field contain control characters?|
|Non-ASCII||Does this field contain non-ASCII characters?|
|Value>MAX||Does the value in this field exceed an established maximum?|
|Value<MIN||Does the value in this field meet an established minimum?|
|Data type?||Does the value match its type (e.g., date, integer, Boolean, string)|
Some of these validation values are obvious and predefined, such as the test that determines whether a field contains control characters or non-ASCII text. All others are set as user-defined parameters through Podium’s interactive graphical user interface (GUI).
As part of the data-ingest process, we look at every value in every field in every record in a massive, parallel load process; check each field for data quality problems; and then partition the records into three groups: the good, the bad, and the ugly.
- Good records pass all quality tests, meaning that all fields in that record match or comply with expected values based on the validation rules. For example, a record with the appropriate number of fields, proper data types, matching delimiters and terminators, and conforming character sets would be considered
- Bad records have a corrupt record structure that often surfaces as not enough or too many fields. This is often caused by record or field delimiters embedded in a string or control, or other hidden characters that Hadoop can’t process.
- Ugly records are configurable but include data that don’t match a field-data type. Examples include invalid date formats; invalid numeric formats; data containing invalid, e.g., unprintable, characters; and data that don’t match a user-specified pattern or regular expression.
As Podium partitions the data into three groups, all data are retained so that they can be subsequently analyzed and, if needed, remediated. The error log shows which fields were problematic and why each record was assigned to the good, bad, or ugly bucket. By reviewing ugly records and understanding the patterns underlying why groups of records failed, users can often adjust validation parameters, rerun the load process, and accurately migrate records into the data lake as good records. Truly bad records can be remediated and reloaded as needed, or simply saved to document the load-process outcome.
By building a gate to identify and fix data-quality issues in source data during the data-lake load process, organizations can give business users better data to drive smarter analysis and business decision-making. By automating and embedding the data-quality checkpoint into an integrated data-lake management platform, Podium helps companies deploy their data lake faster without compromising data quality. Without this validation, users may spend hours, even days writing programs to validate their data and remove dirty records.
In our next and final post in this data-quality blog series, we’ll talk about a final group of particularly pernicious data-quality issues specific to mainframe and legacy data sources.