Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Making It Easier to Clean Big Data

DZone's Guide to

Making It Easier to Clean Big Data

The hope is that these kinds of automated tools will make it easier for a wider spread of organizations to start utilizing data.

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Big Data is great, but it’s only really useful if you can derive insights from it. With much of the data we harvest being somewhat messy and unstructured, organizations often spend far more time tidying up the data than they do gaining insights from it.

I’ve written before about automated approaches to doing this, with projects such as Active Clean from researchers at Columbia University and the University of California at Berkeley, which uses prediction models to test out datasets, and uses the results to understand the fields that require cleaning whilst simultaneously updating the models at the same time.

“Big Data sets are still mostly combined and edited manually, aided by data cleaning software like Google Refine and Trifacta or custom scripts developed for specific data cleaning tasks,” the researchers say. “The process consumes up to 80 percent of analysts’ time as they hunt for dirty data, clean it, retrain their model and repeat the process. Cleaning is largely done by guesswork.”

Error Spotting

Whilst sorting our what is largely inconsequential data is a major part of an analysts’ time, there is also the sizeable task of cleaning up erroneous data that can skew datasets.

A new tool called Vizier, developed by researchers at the University of Buffalo aims to help by proactively catching data errors. The tool allows users to interactively work with datasets, cleaning, curating, and visualizing data in what the team hope are meaningful ways.

The tool is intended for very large datasets with millions of data points.

“We are creating a tool that’ll let you work with the data you have, and also unobtrusively make helpful observations like ‘Hmm… have you noticed that two out of a million records make a 10 percent difference in this average?'” the team says.

The hope is that these kinds of automated tools will make it easier for a wider spread of organizations to start utilizing data, as they won’t have to invest huge sums in building the kind of teams required to clean up the data they’re collecting.

This is especially so as a growing number of government agencies are releasing data, and open data is rapidly becoming the defacto way of operating for governments across the western world.

As an open source tool, Vizier hopes to play an active and positive role in this ecosystem.

“We want to make it easier for data scientists — and eventually data hobbyists — to discover and communicate not only what the data says, but why the data says that,” the team says.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
big data ,data cleaning ,automation

Published at DZone with permission of Adi Gaskell, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}