Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Minimize Data Wrangling and Maximize Data Intelligence

DZone's Guide to

How to Minimize Data Wrangling and Maximize Data Intelligence

What are the most time-consuming tasks that data scientists face, and what kind of tools exist to remove those roadblocks?

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

It's not unusual for data analysts to spend more than half their time cleaning and converting data rather than extracting business intelligence from it. As data stores grow in size and data types proliferate, a new generation of tools are arriving that promise to deliver sophisticated analysis tools into the hands of non-data scientists.

One of the hottest job titles in technology is Data Scientist, perhaps surpassed only by the newest C-level position: Chief Data Scientist. IT's long-standing skepticism about such trends is evident by the joke cited by InfoWorld's Yves de Montcheuil that a data scientist is a business analyst who lives in California.

There's nothing funny about every company's need to translate its data into business intelligence. That's where data scientists take the lead role, but as the amount and types of data proliferate, data scientists find themselves spending the bulk of their time cleaning and converting data rather than analyzing and communicating it to business managers.

A recent survey of data scientists (registration required) conducted by IT-project crowdsourcing firm CrowdFlower found that two out of three analysts claim cleaning and organizing data is their most time-consuming task, and 52 percent report their biggest obstacle is poor quality data. While the respondents named 48 different technologies they use in their work, the most popular is Excel (55.6 percent), followed by the open source language R (43.1 percent) and the Tableau data-visualization software (26.1 percent).

Image title

Data scientists identify their greatest challenges as time spent cleaning data, poor data quality, lack of time for analysis, and ineffective data modeling. Source: CrowdFlower

What's holding data analysis back? The data scientists surveyed cite a lack of tools required to do their job effectively (54.3 percent), failure of their organizations to state goals and objectives clearly (52.3 percent), and insufficient investment in training (47.7 percent).

Image title

A dearth of tools, unclear goals, and too little training are reported as the principal impediments to data scientists' effectiveness. Source: CrowdFlower

New Tools Promise to 'Consumerize' Big Data Analysis

It's a common theme in technology: In the early days, only an elite few possess the knowledge and tools required to understand and use it, but over time the products improve and drop in price, businesses adapt, and the technology goes mainstream. New data-analysis tools are arriving that promise to deliver the benefits of the technology to non-scientists.

Steve Lohr profiles several of these products in an August 17, 2014, article in the New York Times. For example, ClearStory Data's software combines data from multiple sources and converts it into charts, maps, and other graphics. Taking a different approach to the data-preparation problem is Paxata, which offers software that retrieves, cleans, and blends data for analysis by various visualization tools.

The not-for-profit Open Knowledge Labs bills itself as a community of "civic hackers, data wranglers and ordinary citizens intrigued and excited by the possibilities of combining technology and information for good." The group is seeking volunteer "data curators" to maintain core data sets such as GDP and ISO-codes. OKL's Rufus Pollock describes the project in a January 3, 2015, post.

Image title

Open Knowledge Labs is seeking volunteer coders to curate core data sets as part of the Frictionless Data Project. Source: Open Knowledge Labs

There's no simpler or straightforward way to manage your heterogeneous MySQL, MongoDB, Redis, and ElasticSearch databases than by using Morpheus. Morpheus lets you seamlessly provision, monitor, and analyze SQL, NoSQL, and in-memory databases across hybrid clouds via a single point-and-click dashboard. Each database instance you create includes a free full replica set for built-in fault tolerance and fail over.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
data ,data analytics ,database

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}