Over a million developers have joined DZone.

How to Ensure Big Data is Quality Data

DZone's Guide to

How to Ensure Big Data is Quality Data

A team from the University of Twente have set out to provide an easy way of evaluating the quality of data generated by the crowd.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

big-data-qualityThe rise of mobile and social technologies in recent years has led to a deluge of data, much of which has been commandeered by researchers for their studies.

Whilst this data has obvious benefits, these benefits will only materialize if the quality of data can be guaranteed.

A team from the University of Twente have set out to provide an easy way of evaluating the quality of data generated by the crowd.

Validating Crowd-based Data Quality

Nowhere has this deluge been greater than in geographic information submitted by volunteers.  When data is low quality and inconsistent it can bias analysis as the data does not accurately reflect the variable being studied.  It’s important, therefore, to be able to quickly and accurately sift out inconsistent observations from the pack to ensure data remains scientifically viable.

The paper describes a novel automated workflow to do just that.

“Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” the authors say.

The process utilizes contextual information to judge the quality of the information, and has been built using a mixture of dimensionality reduction, clustering and outlier detection techniques.

The process was put through its paces on a project around the flowering of lilac plants in North America that relied heavily on volunteer data.

Whilst it’s inevitable that some unusual observations are valid, the authors showed that these outliers can still cause a bias in the trends.  They suggest therefore, that identifying inconsistent observations is crucial to the accurate study of the topic and needs to be done from the outset.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,data analytics ,mobile data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}