This is the final installment in our blog series on data quality during the ingest process of setting up a data lake.
So far we’ve talked about data quality problems connected to file formatting issues such as embedded or misplaced field and record delimiters. We have looked at the potential havoc those can cause if not addressed as data is added into the data lake. We’ve also looked at another set of data quality issues that have to do with the actual content of each field, and which can render records inaccurate or unusable if not resolved during the ingest process.
In this post, I want to address the topics of data conversion and encryption. While not data quality issues per se, these are data usability issues, which are especially important when it comes to bringing mainframe or XML sources into the data lake.
Mainframe and legacy applications generally organize and store data in a way that presents very particular data conversion issues when those files are ingested into the data lake. Here are some of the mainframe and legacy data specific conversion measures that need to be taken during the ingest process:
- Mainframe data, stored in EBCDIC or Packed-Decimal format, needs to be converted into a UTF-8 standard compliant format.
- The COBOL copybook or equivalent for each mainframe data file needs to be converted into an accurate and complete HCat schema with full field names and data types.
- Fields with OCCURS or REDEFINES clauses, which are really hierarchical records, need to be reformatted either by flattening them into a denormalized record format or by normalizing them into separate tables with keys to join them.
- Multiple record types within a single file must be identified and ingested accurately for their specific format.
- Headers and trailers for each file often contain validation data (record count, table schema, business data) that needs to be verified during the load, then stripped from the data set for querying and analysis.
Similarly XML files present some data conversion issues that also must be addressed on ingest into the data lake, especially if you want to use relational operations against them.
- A valid HCat schema must be created for each XML file, based on the XSD associated with that file, including valid field names and data types.
- Hierarchical structures in XML files have to be normalized using either of the methods listed above.
Trying to address these mainframe and XML specific data conversion issues natively in Hadoop is a lot of work and requires deep expertise in Hadoop programing. However, if it is not done, there is no way to query the data in its native format.
Encryption on Ingest
Another data ingest issue that crops up frequently with mainframe data is the need to protect sensitive data in the data lake through encryption. To truly secure the data lake, encryption needs to happen as part of the ingest process so that sensitive data arrives on the Hadoop cluster already protected. Ideally, that encryption will be at the field level so that you can protect field specific sensitive data without having to copy entire data sets from one place to another in HDFS. Finally, using encrypted data for analytics requires that the data is encrypted consistently. This means that a given value is always encrypted to the same string, so users can analyze data without decrypting it. For example, consistently encrypted data allows tables to use these values to join tables, filter records, and group records for analysis.
Non-Standard Representations of Numbers and Dates
The last thing I want to address is the need to correctly convert non-standard representations of numbers and dates as they are ingested into the data lake. Some applications store numbers and dates in non-standard ways. Excel for example will take very large numbers and put them in exponent notation while some legacy systems have leading plus and minus signs. If ingested into Hadoop without remediation, these non-standard numbers are generally incorrectly treated at strings in Hadoop. To accurately capture this data in the data lake, the ingest process needs to recognize these fields for what they are and to convert them into standard numeric values. Likewise custom data formats need to be mapped into standard data formats so you can do GROUP BYs and JOINs on date fields and filter by date immediately without having to come up with date conversion algorithms.
A Closing Comment
Why Fixing Data Quality on Ingest Into the Lake Matters
In my experience at Podium Data, working with dozens of large companies on data lake projects, the task of ingesting mainframe and legacy data into the data lake nearly always emerges as one of the most difficult and time consuming aspect of the overall effort.
The files themselves are large, messy, and opaque, and the development effort requires deep expertise in COBOL as well as advanced Hadoop skills. We regularly meet with clients who are 2+ years into a data lake effort, and who have sunk thousands of hours of time into trying to bring mainframe data into the data lake, with very little to show for their efforts. This is particularly frustrating because this tedious work—standardizing values, reformatting records, and handling dirty data—is not adding real business value. This series of articles paints a picture of why ingesting mainframe and legacy data into the data lake is such a challenge.
A central part of what Podium Data delivers with our data lake management platform is built-in best practice expertise on solving this part of the data lake challenge. We’ve taken everything we have learned about how to solve data quality issues as part of the ingest process, especially for mainframe and legacy data, and built an automated process into our product.