Dirty Data – Example #1: Embedded Delimiters
Looking at flawed data through Hive and HDFS and what can be done about it.
Join the DZone community and get the full member experience.Join For Free
Let’s look at a simple example of how embedded delimiters cause problems in Hadoop. Shown here is a file of consumer complaints that comes from the Consumer Financial Protection Bureau, which captures a record for every complaint registered with the bureau (www.consumerfinance.gov/data-research/consumer-complaints). In this case, we might want to analyze these complaints and try to understand some of the root cause. We will do a couple steps.
You can see it has multiple fields, which are delimited by commas. Also, there is a header record that has the names of those fields. The header records, or column names, contain spaces and dashes, for example “Date received” and “Sub-product”. There are data strings that include commas and values defined by quotes, for example in the last line: “Other (phone, health club, etc.)”
The first step in analyzing this data would be to put the file onto the Hadoop cluster so we can take advantage of the parallel processing in HDFS in our analysis. We would also start by registering the schema, which is a very simple schema with a dozen or so fields, with HCatalog so that we can run a query. In this case, let’s look at which companies are most complained about. To get that information we just need to run a simple query that counts the number of complaints for each company, like this Hive query:
select count(*) as total, company from complaints group by company order by total desc; Here is what the result looks like: 699 Equifax 589 Experian 551 TransUnion 288 10/06/2014 263 JPMorgan Chase 250 Citibank 247 10/01/2014 242 09/29/2014 240 09/22/2014 238 09/25/2014 218 10/03/2014 208 Bank of America 206 09/30/2014 206 09/24/2014 204 09/26/2014 201 Wells Fargo 194 10/02/2014 181 09/23/2014
The problem with this answer, of course, is that it’s wrong. Not only does it include the names of companies with a tally of how many complaints were filed against them, but also dates. The calculation erroneously identified some dates as companies and associated complaint records with those erroneous companies. The erroneous company “10/06/2014” for example is associated with 288 complaints. This creates another type of error. Specifically because some complaint events are being associated with the wrong company, the tallies for real companies are wrong. The 288 complaints associated with “10/06/2014” in actually belong to one or more actual companies, but we don’t know who. All of the tally figures become suspect.
The root cause for this is not that the data is corrupt, but that they are using a common convention of embedded delimiters inside double quotes. That type of convention trips up the naïve processing of Hadoop. There are steps that can be taken to improve your ability to manage this but this gets much worse when those embedded delimiters are actually new line characters or record delimiters, instead of just field delimiters. It is very hard without an expert Java and MapReduce programming techniques to pre-process this data in parallel and create a data set that is ready for consumption.
Podium Data’s enterprise data lake management platform includes built in functionality to solve problems like this because we recognize that in fact many enterprise data sources contain these types of data issues. Providing data ingestion capabilities that address these problems accurately and automatically is essential functionality to deploying an enterprise scale data lake.
Sometimes the ingestion challenge is that there is no description of the layout of the file and records. In this case, the Podium data management platform forensically discovers the structure by crawling through and reading the data file itself. Podium infers the structure of the file (headers, trailers, column names, record counts, character set), records (fixed/variable length, delimiters, multiple record types), and fields (delimiters, datatypes, and enclosures) prior to ingestion. Of course, if layout information such as DDL, XSD, or COBOL copybooks exist Podium can use these to directly set metadata properties used to control the ingestion process. All of these configuration properties are accessible through Podium’s GUI and can be reviewed and updated to enable accurate ingestion of that data set. Using those configuration properties, Podium reads incoming data, record by record, and identifies and addresses all potential data issues with embedded delimiters, control characters, invalid character sets, corrupted records, or hidden characters, which might otherwise lead to data problems as presented in the example above. Finally, Podium registers the schema in HCatalog automatically, so data is ready to be queried—correctly — the minute it is loaded.
The end result is that data files ingested into Podium are an accurate representation of the source data, perfectly copied into Hadoop without the unintended and worse yet perhaps undetected introduction of data errors or corrupted records.
Even better, because Podium is driving the ingest process using a automated enterprise scale data management application, and because Podium executes the full ingest process on the Hadoop cluster in parallel, Podium can ingest data sets – even very large data sets like multi terabyte mainframe databases – generally in less than an hour, often in just minutes. If you want a better sense for how this works, you might want to take a quick tour of the Podium product itself here.
In the next post, we’ll explore further why Hadoop alone struggles to address these types of dirty data problems upon ingest and how Podium overcomes these issues, under the covers.
Published at DZone with permission of Paul Barth. See the original article here.
Opinions expressed by DZone contributors are their own.