Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Tutorial: Running an End-to-End Data Science Workflow in Watson Studio Cloud

DZone 's Guide to

Big Data Tutorial: Running an End-to-End Data Science Workflow in Watson Studio Cloud

From cleaning to classification: everything you need for an efficient data science workflow.

· Big Data Zone ·
Free Resource

Your smooth workflow after reading this article

Your smooth workflow after reading this article

A data scientist’s day-to-day work is so much more than just building machine learning models with 99% accuracy. In many ways, great data science is like great art. We know that making great art requires more than inspiration, as artists devote time to exploring, learning, and mastering their tools. Data science is no different.

Much like the artistic process, a data scientist follows the data science workflow in an effort to create their own original and compelling work. Watson Studio Cloud (WSC) provides support for Data Scientists at every step of this workflow, making it a tool as valuable to the Data Scientist as paint is to the artist.

You may also like Introduction to Computer Vision With OpenCV and Python.

Almost every Data Scientist wants to follow the same process:

Data processing, analysis, and classification

Data processing, analysis, and classification

Let’s follow this workflow using the example of German Credit Risk Data and see how Watson Studio Cloud helps bring a data science project together from start to finish.

Part One: Connect and Access Data

After we create our account, we’ll see the resources we’ve provisioned for our account on the IBM Cloud landing page. First, provision a Cloud Object Storage (COS) to associate with our account.

Next, go back to the IBM Watson Studio home page and choose to Create an empty project.

Creating an empty project in IBM Watson Studio
Creating an empty project in IBM Watson Studio

In this example, we’ll create a project to contain all our assets (data, notebooks, models, etc). Be sure to name and describe our project and associate it with the COS we’ve provisioned.

After we create the project, we can add data to it. When we click on the + Add to Project button, we’ll see something like this:

Choosing an asset type
Choosing an asset type"

We can either choose to add Data which lets us upload data directly from our device, or we can choose to add data from remote storage by selecting Connection. There are several available options for the task.

Creating new connection to database
Creating a new connection to database

For this example, we’ll use a dataset we upload locally.

But first, let’s check out a Db2 Warehouse connection setup as another example. Be sure to have the access credentials for the data connection that you’re trying to add.

New connection to the Db2 Warehouse
New connection to the Db2 Warehouse

We can select Discover Data Assets to automatically add the datasets from the storage to our project of choice.

We can also add collaborators to our projects to integrate our work with that of our team members:

Adding team members to collaborate on the project
Adding team members to collaborate on the project

Part Two: Search and Find Relevant Data

Once we have our data assets in place, we can start cataloging the datasets that are important to us. For this, we’ll go back to the IBM Watson Studio home screen and select Create a Catalog.

There, we can choose to enforce certain data policies on our datasets. For this example, I’ve chosen to enforce all data policies. Later, we’ll see how these policies come into play.

Cloud object storage in IBM Watson Studio
Cloud object storage in IBM Watson Studio

Let’s check out my catalog by way of its landing page:

Example catalog
Example catalog

Note that all the assets in this catalog have tags, ratings, and associated reviews. As we build our catalog, these features become more and more useful  —  as we can quickly filter by highly rated assets or assets of a certain category and use them in projects. Under Access Control, we can add collaborators to the catalog.

Now that we’re inside this catalog, let’s explore some of the assets, starting with the structured dataset German Credit Data.csv.

German Credit Data csv
German Credit Data csv

The Overview tab gives a quick description of our dataset. The Access tab shows who has access to the data. The Review tab lets us rate and comment on the dataset; the Profile tab shows statistics for our dataset. The Profile tab looks something like this:

Profile tab example
Profile tab example

Finally, the Lineage tab shows the life cycle of a dataset:

Lineage tab example
Lineage tab example

Here, the first yellow "+" shows when we added the asset to the catalog. We can also see who added it and other metadata. The "pencil" markers describe asset updates, for example, when tags were created for the asset. The joined nodes tell when, where, and by whom the asset was added to a project.

Let’s now take a step back to look briefly at how these features function for unstructured data. I’ll open the file Pride and Prejudice.pdf. The Overview tab is very similar to the one for our credit data, but interesting things happen under the Profile tab. Watson Studio can run Watson Natural Language Understanding on the data and display particular information about it:

Natural Language Understanding in Watson Studio
Natural Language Understanding in Watson Studio

Similarly, in addition to Categories, we can see Concepts, Sentiment, or Emotion.

I mentioned that we’d return to the notion of enforcing policies. Those policies come into play whenever we’re adding a dataset to our catalog. We’ll have the option of classifying it as a type of dataset governed by a certain policy.

Defining the type of data set governed by a certain policy
Defining the type of data set governed by a certain policy

When we classify a dataset as belonging to any of these categories, we’ll be able to see its classification under the Overview tab.

Now that we know we can add datasets from projects to catalogs, as well as from catalogs to projects; let’s see what else we can do within a project.

Part Three: Prepare Data for Analysis

In this part of the data science workflow, Data Refinery becomes useful. Refinery is a self-service data preparation tool for data scientists and analysts. When we open a data asset within a project, at the top right corner we’ll see the option to Refine that data.

Refining data
Refining data

Selecting Refine takes us to the Data Refinery service, where we can perform operations by first choosing a column and selecting the transformation we want to apply from the drop-down menu.

Selecting desired column for data transformation
Selecting the desired column for data transformation

Then, we can apply more customized transformations by entering R code into the box that says Code an operation to cleanse and shape your data.

Entering R code to cleanse and reshape data
Entering R code to cleanse and reshape data

Once we have added all the transformation steps into our refinery, we can save and run the flow. In this example, Watson Studio saves the flow name as German Credit Data.csv_flow and the output of the flow (our refined data) under assets as German Credit Data.csv_shaped.csv.

Another important element of data exploration is creating a visual representation of the data. For this, we can create various Analytics Dashboards to quickly analyze our data. Within our project, we can choose to add a dashboard, give our new dashboard a name, and save it. Watson Studio then redirects us to a window where we can choose a template for our dashboard based on the number of graphs we might want to add.

Once we select what we want, we see the following window:

Choosing a template for the dashboard
Choosing a template for the dashboard

We can click "+" next to Selected sources to add desired datasets to our dashboard. This example uses German Credit Data.csv. We can then select the kind of charts we want to add to our dashboard.

Below, I add a bar chart for my first box. There, I can drag, for example, the Sex column in front of both the Bars and Length elements to get the count of males and females in my dataset.

Getting count of individuals by sex
Getting count of individuals by sex

This visualization then becomes my top left element in the dashboard below. Carrying on, I can add other visualizations and create the following credit data dashboard.

Adding visualizations to the credit card dashboard
Adding visualizations to the credit card dashboard

Now, we can start building a deeper analysis of our dataset. For example, we can see that men at age 42 take high loans in terms of total sum. So, we can filter out that subset and look for insights from other graphs that change dynamically.

Filtering out the subset and looking for insights
Filtering out the subset and looking for insights

We can add multiple tabs to the same dashboard using the "+" icon next to the first tab.

Data exploration is an incredibly important part of the data science workflow for both understanding our data and planning future analysis. As we’ve seen, the tools within Watson Studio make the process straightforward and ensure accuracy.

Part Four: Build, Train and Deploy Machine Learning/Deep Learning Models

Watson Studio Cloud offers multiple ways to build, train, and deploy models.

  1. Coding (within Python, Spark, Scala or R environments in Jupyter notebooks, or R Studio).

  2. Using drag and drop methods in Modeler Flows.

  3. Using Automated Machine Learning with AutoAI.

Modeling With Coding Methods

Notebooks are a neat way to break up our chunks of code and make it more interactive in order to tell a story about our data science project. The Community Contribution in Watson Studio Cloud has some great resources to get started.

Let’s return to our German Credit Risk Model. After some data preprocessing, I built a scikit-learn Logistic Regression model to classify my population into “Risk” and “No Risk” categories. Once we’ve trained the model, we can save it in Watson Machine Learning (WML) and create a model deployment:

Creating a model deployment
Creating a model deployment

Once we’ve added our credentials, we instantiate a WML object:

Instantiating a WML objectThen, we save the model in WML.

Instantiating a WML object

To create a deployment, we’ll need to fetch the model UID:

Fetching the model UID
Fetching the model UID

Now, we’re ready to use this deployment to score a model. First, we’ll get our scoring endpoint and then send our payload to this endpoint for scoring.

This is how to do model training —testing  — deployment — scoring cycle in a notebook using Python. We could also replicate this using R.

Creating a scoring endpoint in Python
Creating a scoring endpoint in Python

Modeling With Drag-and-Drop Methods

Drag-and-drop methods can be useful for both machine learning and deep learning models. To get our understanding of the service, let’s take a small detour from our credit risk scoring use case to look at a deep learning flow for an image classification model. We can build a neural network for credit scoring in the same way.

Creating a neural net for classification
Creating a neural net for classification

At the right, we can set the parameters for each of the cells   — a Conv 2D layer, in this case.

Creating a Conv 2D layer
Creating a Conv 2D layer

At the left, we see the palette where we can select the element we need on our canvas. After we’ve created the desired flow, we can import this flow as TensorFlow/Keras/PyTorch/Caffe code.

Importing the desired flow
Importing the desired flow

We can also save this as a training definition for running experiments using Experiment Builder on Watson Studio. We can use Experiment Builder to test the deep learning models we build. We start by adding “Experiment” to the project.

Under + Add Training Definition, we can choose Add Existing Training Definition and pick the one we saved from the Modeler.

Adding existing training definition
Adding existing training definition

Above, we see the output of an experiment run.

We can also use Experiment Builder for deep learning models built in Python using frameworks such as Keras or TensorFlow. 

Modeling With Auto AI

Building AutoAI experiments within Watson Studio is quite straightforward. We simply upload the data set we want to run AutoAI for and choose the prediction column and the metric we want to optimize.

Choosing prediction column and desired metric-optimization
Choosing prediction column and desired metric-optimization

Then, hit Run experiment. We’ll then see the experiment pipeline:

Experiment pipeline
Experiment pipeline

We can see the model leaderboard and examine all the models we’d like to consider. Let’s look at some results from our top model:

Results from model (one)
Results from model (one)
Results from model (two)
Results from model (two)

Once we’re satisfied with the model we have, we can save it for use in downstream production.

Part Five: Monitor, Analyze, and Manage

Now, it is finally time to incorporate Watson OpenScale within Watson Studio to make this a full data science masterpiece.

After we create model deployments, we can integrate them with Watson OpenScale for continuous model monitoring. Once we’ve created and configured our Watson OpenScale instance, we can choose the deployments we want to monitor. Look for more information on setup in the documentation here.

While setting up our configurations we can define thresholds for how fair and accurate we want our model to be before we see alerts. After we’ve set our configurations, our landing page will look like this:

Insights Dashboard
Insights dashboard

To look at any particular model we can just click on its tile:

Further examining chosen model
Further examining chosen model

These metrics give us insights into how accurate and fair our model is. At any point, we can decide to dig deeper into the analysis by clicking on Click to view details:

Credit Scoring model details
Credit Scoring model details

We can also see a list of all transactions as well as transactions that were biased.

Examining transactions with the credit scoring
Examining transactions with credit scoring

The Explain option offers model explainability by letting us see how a model reached its conclusions:

Explaining how model reached predictions
Explaing how the model reached predictions

We can then also choose to use a model de-biasing option within OpenScale. Look for more information about OpenScale in Bias Detection in IBM Watson OpenScale and De-Biasing in IBM Watson OpenScale.

So that’s it  —  we’re now at the end of our “End-to-End” journey. I hope this article helps you streamline your workflow better using IBM Watson Studio to create your own data science masterpiece. And I hope reading (and implementing) your project will be as exciting for you as building the demo and writing the article was for me.


Related Articles

Topics:
big data ,end-to-end workflow ,data science ,ibm ,db2 warehouse ,csv ,neural net ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}