Over a million developers have joined DZone.

The Secret to Getting Data Lake Insight: Data Quality

DZone's Guide to

The Secret to Getting Data Lake Insight: Data Quality

For companies to get the most out of their digital transformation projects and build an agile data lake, they need to design data quality processes from the start.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

More and more companies around the globe are realizing that big data and deeper analytics can help improve their revenue and profitability. As such, they are building data lakes using new big data technologies and tools, so they can answer questions such as, How do we increase production while maintaining costs? How do we improve customer intimacy and share of wallet? What new business opportunities should we pursue? Big data is playing a major role in digital transformation projects; however, companies that do not have trusted data at the heart of their operations will not realize the full benefits of their efforts.

Instituting Sustainable Data Quality and Governance Measures

If big data is to be used, organizations need to make sure that this information collection is under control and sticks to a high standard. Yet, according to a recent report by KPMG, 56% of CEOs are concerned about the quality of the data they're using to base decisions. To improve the trustworthiness of data as it flows through the enterprise, companies need to look at the entire data quality lifecycle including metadata management, lineage, preparation, cleansing, profiling, stewardship, privacy, and security.

A few weeks ago, Gartner released the 2017 Gartner Magic Quadrant for Data Quality Tools — a report that reviews the data quality lifecycle and showcases innovative technologies designed to "meet the needs of end-user organizations in the next 12 to 18 months."

The report highlights the increasing importance of data quality for the success of digital transformation projects, the need to use data quality as a means to reduce costs, and the changing requirements to be a leader. Some of the trends highlighted in the report that speak directly to data lake development and usage include:

  • The need to capture and reconcile metadata.
  • The ability to connect to a wide variety of on-premises and cloud structured and unstructured data sources.
  • The importance of DevOps and integration interoperability in the data quality environment.
  • How business users are now the primary audience and need data quality workflow and issue resolution tools.
  • The increasing requirement for real-time data quality services for low-latency applications.

Machine Learning and Natural Language Processing to the Rescue

As companies ingest large amounts of unstructured and unknown data, it can be a challenge to validate, cleanse, and transform the data in sufficient time without delaying real-time decisions and analytics. This does not mean that 100% of the data lake needs to be sanctioned data, as companies will create a data lake partition of "raw data" which data scientists often prefer for analysis. In addition, raw and diverse data can be provisioned among different roles before enrichment, shifting from a single version of the truth model to a more open and trustworthy collaborative governance model.

In the past, data quality would rely solely on a complex algorithm; for example, probabilistic matching for deduplicating and reconciling records. An important trend we are seeing at Talend and outlined in the Gartner report is the use of machine learning with data quality to assist with matching, linking, and merging data. With the sheer volume and variety of data in the data lake, using Hadoop, Spark, and machine learning for data quality processing means faster time to trusted insight. Data science algorithms can quickly sift through gigabytes of data to identify relationships between data, duplicates, and more. Natural language processing can help reconcile definitions and provide structure to unstructured text, providing additional insight when combined with structured data.

Machine learning can be a game changer because it can capture tacit knowledge from the people that know the data best, then turn this knowledge into algorithms, which can be used to automate data processing at scale. Furthermore, through smarter software that uses machine learning and smart semantics, any line of business user can become a data curator - making data quality a team sport! For example, tools such as Talend Preparation and Data Stewardship combine a highly interactive, visual and guided user experience with these features to make the data curation process easier and the data cleansing process faster.

Devising a Plan for Agile Data Quality in the Data Lake

Implementing a data quality program for big data can be overwhelming. It is important to come up with an incremental plan and set realistic goals; sometimes getting to 95% is good enough.

  1. Roles: Identify roles, including data stewards and users of data.

  2. Discovery: Understand where data is coming from, where it is going, and what shape it is in. Focus on cleaning your most valuable and most used data first.

  3. Standardization: Validate, cleanse, and transform data. Add metadata early so that data can be found by humans and machines. Identify and protect personal and private organizational data with data masking.

  4. Reconciliation: Verify that data was migrated correctly.

  5. Self-service: Make data quality agile by letting the people who know the data best clean their own data.

  6. Automate: Identify where machine learning in the data quality process can help, such as data deduplication.

  7. Monitor and manage: Get continuous feedback from users, come up with data quality measurement metrics to improve.

In summary, for companies to get the most out of their digital transformation projects and build an agile data lake, they need to design data quality processes from the start.

*Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,data lake ,data quality ,machine learning ,nlp ,data ingestion ,data science ,data processing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}