Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Tools for Data Science: Using the Right Ones for the Job

DZone's Guide to

Tools for Data Science: Using the Right Ones for the Job

Developers hear all the time how important it is too use the right tools — but this has never been truer than in the world of data science and machine learning.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

You've probably heard the sayings many times: "You're only as good as your tools," and "Use the right tool for the right job." These have never been truer than in the world of data science and machine learning.

Part One: The Wrong Tools

When I was first learning data science and machine learning, I tried a sample data science project using a dataset from a recent medical study. For the study, patients came for appointments where investigators took their vital signs (weight, blood pressure, pulse, temperature, etc.) and administered some number of medical tests.

The tests could create a few values or many values, depending on which tests the investigators ran on the date of the patient's appointment, how long the patient had been part of the study, and how often they received tests. Some patients had only a month of data history, while other patients had years' worth of results. Investigators compiled all of the data from all of the tests into a single large data set.

The dataset was available as a CSV file in what is called an EAV format (Entity, Attribute, Value).

EAV is a typical format when a table representation would be very sparse. Unfortunately, many analytics tools, including the one I was using, require a tabular format in order to import the data. So, making the conversion was my first task. You often hear that the data understanding and data preparation steps can be 80% of the work of a data scientist. In this case, it was well over 90%.

I started by writing code and Excel VBA macros to start to convert the EAV data into a tabular format. In the process, I learned that some of the feature columns had only a few values so I deleted those. Since the frequency of the data was fairly random, I decided to keep the data from each patient that was gathered at their first appointment, their 90-day appointment, and their 365-day appointment.

Again, with the randomness of the appointments, I had to pick test data within a +/- ten-day window around the 90- and 365-day appointments. I then dropped any patient without data for those three different points in time. Many more issues arose as I worked with the data and had to figure out what I wanted to do, but even after several weeks of work, the table was still sparse.

To compensate, I did some work to fill in features. For example, if I needed a value for the 90-day appointment I might take the average between a 70-day appointment and a 110-day appointment and use that average as the 90-day value. Sometimes only 9 our of 10 values for a particular test were recorded, so I looked at what that value was for the previous appointment as well as the following appointment. It might be that they were the same and I would use that to fill in the missing value. I also performed some simple feature engineering like subtracting the values of day zero from day 365 to see how they changed over the year. Some of the values were so sparse I just deleted those attributes. Any column of data I couldn't get to be up to around 80% filled in wasn't worth keeping.

None of this work will surprise a data scientist, but being new to this, I really didn't expect it to take so much of my time.

After quite a few weeks working with the data (code, Excel, and VBA), I finally had it converted to a suitable tabular format and I could start working with a tool to do my data exploration and analytics work.

Part Two: The Right Tools

Several months later, I started work at IBM's Silicon Valley Lab (SVL) and had the opportunity to interact with the Senior Data Scientists at SVL Machine Learning Hub. That meant I was able to be involved when a medical company came to the ML Hub to take advantage of a free two-day workshop, offered to show how data science and machine learning could help with a clinical trial they were conducting. (Read here for more details on these free ML Hub workshops.)

The data — you guessed it — was in that same EAV format. But this time the right tools were easily available. Using IBM's Data Science Experience (DSX), the data scientists ingested the data with Jupyter notebooks written in Python and R, and began the work of data visualization and exploration within a few short hours - something that had taken me many weeks.

They simply opened DSX on the cloud, created a project and invited all those working on the problem to join that project so that when I logged in, I could select that same project and see everything being worked on. Notebooks provisioned on top of Apache Spark™ were created in seconds. The collaboration was underway in a matter of minutes.

From there, they converted, engineered, and analyzed data, and started testing different machine learning algorithms. That was just the first day. It was clear that DSX was a much better platform for this kind of work.

Now let's be honest. This work was being done by two senior data scientists, but the difference was immense. What they could do with the right tools made the work go incredibly more quickly. The time was spent working the problem, not working the data.

The right tools for the right job.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data science ,machine learning

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}