Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Ensure Big Data is Quality Data

DZone's Guide to

How to Ensure Big Data is Quality Data

A team from the University of Twente have set out to provide an easy way of evaluating the quality of data generated by the crowd.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

big-data-qualityThe rise of mobile and social technologies in recent years has led to a deluge of data, much of which has been commandeered by researchers for their studies.

Whilst this data has obvious benefits, these benefits will only materialize if the quality of data can be guaranteed.

A team from the University of Twente have set out to provide an easy way of evaluating the quality of data generated by the crowd.

Validating Crowd-based Data Quality

Nowhere has this deluge been greater than in geographic information submitted by volunteers.  When data is low quality and inconsistent it can bias analysis as the data does not accurately reflect the variable being studied.  It’s important, therefore, to be able to quickly and accurately sift out inconsistent observations from the pack to ensure data remains scientifically viable.

The paper describes a novel automated workflow to do just that.

“Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” the authors say.

The process utilizes contextual information to judge the quality of the information, and has been built using a mixture of dimensionality reduction, clustering and outlier detection techniques.

The process was put through its paces on a project around the flowering of lilac plants in North America that relied heavily on volunteer data.

Whilst it’s inevitable that some unusual observations are valid, the authors showed that these outliers can still cause a bias in the trends.  They suggest therefore, that identifying inconsistent observations is crucial to the accurate study of the topic and needs to be done from the outset.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data analytics ,mobile data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}