Over a million developers have joined DZone.

A Collaborative Approach to Data Science

DZone's Guide to

A Collaborative Approach to Data Science

Learn about the feature engineering tool that reduces the time data scientists spend on defining prediction problems to days rather than months.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Organizations have a rapidly expanding amount of data with which to work, but deriving insight from that data is a challenge — not least because of persistent skills shortages in the industry. Data analysis typically begins with the identification of so-called "features," which are data points that are believed to have predictive powers. The identification of these features is usually something that requires a degree of experience.

A team from MIT has developed a new tool called FeatureHub that they believe will make feature identification easier. The tool, which is documented in a recently published paper, takes a collaborative approach to the task, with data scientists working together to review a problem and propose features that are then tested by the software against target data to gauge their usefulness.

When the software was tested, a team of 32 data scientists spent five hours each with it before tackling a couple of data science challenges. The suggestions proposed by the system were compared with those submitted by the community at Kaggle, with each suggestion rated on a 100-point scale. Interestingly, the suggestions proposed by the software were both within three to five points of the winning entries on Kaggle.

Efficient Solutions

Where the software comes into its own is in the timeliness of its suggestions. Whereas a high-performing entry on Kaggle would usually take at least a few weeks of work, FeatureHub returned a suggestion within days.

The team hopes that eventually, the platform will obtain a scale similar to its namesake and inspiration, GitHub, which is a huge platform and repository for open-source programming projects.

"I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention," they say. "I think that the concept of massive and open data science can be really leveraged for areas where there's a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses."

The project builds upon previous work by the team, which I documented last year. The work was chronicled in a couple of papers, including the preparation of data and even the creation of problem specifications.

"The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in," the authors say. "[Data scientists want to know], 'Why don't you show me the top 10 things that I can do the best, and then I'll dig down into those?' So [these methods are] shrinking the time between getting a data set and actually producing value out of it."

The researchers, who are bringing their tool to market via their Feature Labs company, developed a new programming language called Trane to reduce the time data scientists spend on defining prediction problems to days rather than months. The team is confident that similar improvements can be made for label-segment featurize (LSF) processes.

Suffice to say, the project is at an early stage, but given the challenges inherent in making sense of data it's an interesting area to follow.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

data analysis ,data science ,big data ,predictive analytics ,feature identification ,open data ,kaggle ,collaboration

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}