One of the often-cited benefits of using a Hadoop data lake to deliver data to the business is that you don’t have to do a lot of prep work to the data before you begin your work. Just load it into the data lake and dig in.
If only it were that easy. When data is copied from its original location into Hadoop, it is not checked for data errors, incorrect formatting, and other idiosyncrasies which are very common in mainframe and legacy data sources. If undetected and unaddressed, you end up with bad data in the data lake. In this series of articles, I’ll explain in detail how we designed Podium to resolve this.
Without first ensuring the validity of the data being loaded into the data lake, you will definitely get a case of “indigestion,” as Hadoop will happily take any data you provide it, but provide no indication that there may be any issues with the data. Not only will this cause errors in results, you’ll often get the wrong results and you won’t even know.
How is that possible? Many mainframe, legacy, and dirty data sources have quality issues that must be identified and fixed before the data is loaded and made available via HCatalog for access by the user. If those problems aren’t fixed on ingest, the data in the data lake will be incorrect in ways that are hard to detect and hard to fix. This includes quality issues such as:
- Embedded delimiters, where a value contains the delimiter that separates fields, like commas
- Corrupted records where an operational system may have inadvertently put control characters in values
- Data type mismatches, such as alphabetic characters in a numeric field
- Non-standard representations of numbers and dates
- Headers and trailers that contain control and quality information that need to be processed differently than traditional records
- Multiple record types in a single file
- Mainframe file formats that use different character sets on legacy systems, which Hadoop does not know how to recognize or process
These are seven examples we often see that create dirty data in the data lake if you don’t identify and fix data issues during the ingest process. Our enterprise data management platform, Podium, handles all of these data quality issues in a massively-parallel, single-pass process as it ingests data into the Hadoop cluster and allows us to produce ready-to-query tables, registered in HCatalog the minute they land. This process breaks out problematic records but still loads them and makes them available for end users to remediate.
In my next post, I’ll provide a simple, a real-life example of dirty data and how to deal with it that so you don’t get indigestion.