Why Data Quality Should Be the Red Thread of your Data Strategy [Interview]
Why Data Quality Should Be the Red Thread of your Data Strategy [Interview]
How can companies make accurate decisions based on poor quality data? They can't. Remember the saying: Garbage in, garbage out.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I believe this year's Gartner Magic Quadrant for Data Quality Tools report represents further proof of the fundamental shift in what enterprises need to support their data quality initiatives. With the continued growth in volume, variety, and velocity of data collected and managed by data lakes not showing any sign of stopping, it would make sense that the requirements for data quality tasks such as defining relevancy, recency, and range increase at a similar pace.
Open source is one thing that is likely more causal than correlative in the proverbial "changing of the guard" taking place in the market in terms of companies' approach to data integration in the era of big data. Over the last decade, there has been increasing acceptance of open-source technologies as formidable enterprise solutions, enabling frameworks like Apache Spark to replace their proprietary and now antiquated counterparts. Resulting from this change, we see the emergence of new customer requirements demanding interoperability with their framework of choice, which gives them the flexibility to adapt to ever-evolving market needs. This makes one wonder: Are vendor solutions that restrict or exclude interoperability with Spark mean they are out of touch with both the business and customer demands of not only today but in the future? Perhaps proprietary vendors still believe they know best.
We believe this year's Gartner Magic Quadrant for Data Quality Tools confirms that market dynamics are changing in a direction Talend forecast some time ago. The market is shifting to cloud and big data and customers need flexible platforms that can keep pace with rapidly evolving technologies that help them manage those new frontiers. As I see it, the only way to do this is to be open source-based. Talend has always been open source-based, but what many may not know is that data quality has also always been part of our Data Integration DNA, which is why it is at the core of our Talend Data Fabric platform. As the saying goes, "garbage in, garbage out." How can companies make accurate decisions based on poor quality data? We believe Talend's move from a Visionary to a Leader in this year's Gartner Magic Quadrant for Data Quality Tools is due to our completeness of vision and ability to execute, further validating that Talend is moving in the right direction — addressing otherwise unmet customer needs to be more data-driven.
Now, I imagine the publication of this MQ will prompt blogs, announcements, and articles opining the merits of various Data Quality products or approaches. For my side, I'd like to highlight an interview I had recently with one of our community members, Michael Covert, CEO of Analytics Inside. In our discussion, he spoke about his company's use of Talend to solve a Data Quality and Governance initiative for a Healthcare customer.
Nick: When starting a data governance initiative, what's one of the first things organizations should do?
MC: One of the first things we advise our customers to do is undergo a data review and cleansing task. It is important to gain a quick understanding of just what you are dealing with... get a sense of how "dirty" the data is, whether date formats are invalid, data requires preprocessing to remove punctuation, to capitalize, etc. In this particular project, the customer had a variety of data sources, both structure and unstructured, from which they needed to extract legal entity information. This consisted of company names, addresses, phone numbers, employer identification numbers (EINs), and other pieces of information that could be placed into a corporate wide master file.
Nick: That's not an easy task, given the variety of both file type and format. How did you solve that particular challenge?
MC: The key is to identify the "named entities" in the free-formed text. Our expertise in the area has allowed us to build a solution (RelExtract) using OpenNLP to do Natural Language Processing which is easily embedded into the process flow using Talend for further matching and deduplication. It should be noted that there's no way to completely automate this process. It can be difficult to impossible to define automated rules that can handle each of the variants that could be used, sometimes human interaction is simply required.
Nick: How does an organization overcome the knowledge hurdle here when often, IT doesn't have the background on the data to make the call?
MC: You're right. The need for business participation occurs in both the preprocessing and exception handling steps. That's why finding a collaborative solution becomes extremely critical to the project if you want to make any headway, and also why Talend is our go-to platform. Talend not only provides a complete data analysis framework, but also supplies an intuitive set of web-based interfaces that allow business users to participate, and in fact, own the data governance initiative. In this particular example, we had the business work with Talend's Data Preparation tool allowing the business users to examine ad hoc data sources and devise "recipes" that could be played against the data set to cleanse the data as much as is possible. When the business context was established, we worked with the data engineers so that they could build their integration pipelines to ingest a variety of records, reference the recipes for preprocessing and compare the cleansed records to several "gold record" database tables. Most of the time, matches were found and the variant types repaired, and for the exceptions, we wrote them to the Talend Data Stewardship Console (TDSC) where, later, a business user could examine and correct (or discard) the records.
Nick: You mentioned a number of benefits to our integrated platform, where there any others?
MC: It can't be understated how flexible the Talend platform is. Talend made it really easy for my team to develop a set of reusable components that we standardized our approach on for all of the subsequent efforts. This reduced the development effort drastically (less development and less QA). On top of that, the original plan was to deploy in a standard data integration environment but as the project grew we ended up productionizing into a big dData environment. There was no recoding required to do so. In fact, some developers didn't even know that their code had been migrated!
Nick: What was the impact on the customer?
MC: Overall the combination of our expertise and Talend's platform reduced the business load that was being placed on the IT organization, which, in many cases, was already overworked. We were able to get the groups talking and ultimately align with responsibilities and deliverables resulting in each unit taking greater ownership of what they ultimately were charged with doing. The result was an overwhelming success, and the solutions devised are now core components of their data architecture.
**Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.**
Published at DZone with permission of Nick Piette , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.