Over a million developers have joined DZone.

How to Minimize Data Wrangling and Maximize Data Intelligence

What are the most time-consuming tasks that data scientists face, and what kind of tools exist to remove those roadblocks?

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

It's not unusual for data analysts to spend more than half their time cleaning and converting data rather than extracting business intelligence from it. As data stores grow in size and data types proliferate, a new generation of tools are arriving that promise to deliver sophisticated analysis tools into the hands of non-data scientists.

One of the hottest job titles in technology is Data Scientist, perhaps surpassed only by the newest C-level position: Chief Data Scientist. IT's long-standing skepticism about such trends is evident by the joke cited by InfoWorld's Yves de Montcheuil that a data scientist is a business analyst who lives in California.

There's nothing funny about every company's need to translate its data into business intelligence. That's where data scientists take the lead role, but as the amount and types of data proliferate, data scientists find themselves spending the bulk of their time cleaning and converting data rather than analyzing and communicating it to business managers.

A recent survey of data scientists (registration required) conducted by IT-project crowdsourcing firm CrowdFlower found that two out of three analysts claim cleaning and organizing data is their most time-consuming task, and 52 percent report their biggest obstacle is poor quality data. While the respondents named 48 different technologies they use in their work, the most popular is Excel (55.6 percent), followed by the open source language R (43.1 percent) and the Tableau data-visualization software (26.1 percent).

Image title

Data scientists identify their greatest challenges as time spent cleaning data, poor data quality, lack of time for analysis, and ineffective data modeling. Source: CrowdFlower

What's holding data analysis back? The data scientists surveyed cite a lack of tools required to do their job effectively (54.3 percent), failure of their organizations to state goals and objectives clearly (52.3 percent), and insufficient investment in training (47.7 percent).

Image title

A dearth of tools, unclear goals, and too little training are reported as the principal impediments to data scientists' effectiveness. Source: CrowdFlower

New Tools Promise to 'Consumerize' Big Data Analysis

It's a common theme in technology: In the early days, only an elite few possess the knowledge and tools required to understand and use it, but over time the products improve and drop in price, businesses adapt, and the technology goes mainstream. New data-analysis tools are arriving that promise to deliver the benefits of the technology to non-scientists.

Steve Lohr profiles several of these products in an August 17, 2014, article in the New York Times. For example, ClearStory Data's software combines data from multiple sources and converts it into charts, maps, and other graphics. Taking a different approach to the data-preparation problem is Paxata, which offers software that retrieves, cleans, and blends data for analysis by various visualization tools.

The not-for-profit Open Knowledge Labs bills itself as a community of "civic hackers, data wranglers and ordinary citizens intrigued and excited by the possibilities of combining technology and information for good." The group is seeking volunteer "data curators" to maintain core data sets such as GDP and ISO-codes. OKL's Rufus Pollock describes the project in a January 3, 2015, post.

Image title

Open Knowledge Labs is seeking volunteer coders to curate core data sets as part of the Frictionless Data Project. Source: Open Knowledge Labs

There's no simpler or straightforward way to manage your heterogeneous MySQL, MongoDB, Redis, and ElasticSearch databases than by using Morpheus. Morpheus lets you seamlessly provision, monitor, and analyze SQL, NoSQL, and in-memory databases across hybrid clouds via a single point-and-click dashboard. Each database instance you create includes a free full replica set for built-in fault tolerance and fail over.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

data,data analytics,database

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}