Big Data Tutorial: Running an End-to-End Data Science Workflow in Watson Studio Cloud
Big Data Tutorial: Running an End-to-End Data Science Workflow in Watson Studio Cloud
From cleaning to classification: everything you need for an efficient data science workflow.
Join the DZone community and get the full member experience.Join For Free
A data scientist’s day-to-day work is so much more than just building machine learning models with 99% accuracy. In many ways, great data science is like great art. We know that making great art requires more than inspiration, as artists devote time to exploring, learning, and mastering their tools. Data science is no different.
Much like the artistic process, a data scientist follows the data science workflow in an effort to create their own original and compelling work. Watson Studio Cloud (WSC) provides support for Data Scientists at every step of this workflow, making it a tool as valuable to the Data Scientist as paint is to the artist.
You may also like Introduction to Computer Vision With OpenCV and Python.
Almost every Data Scientist wants to follow the same process:
Let’s follow this workflow using the example of German Credit Risk Data and see how Watson Studio Cloud helps bring a data science project together from start to finish.
Part One: Connect and Access Data
After we create our account, we’ll see the resources we’ve provisioned for our account on the IBM Cloud landing page. First, provision a Cloud Object Storage (COS) to associate with our account.
Next, go back to the IBM Watson Studio home page and choose to Create an empty project.
In this example, we’ll create a project to contain all our assets (data, notebooks, models, etc). Be sure to name and describe our project and associate it with the COS we’ve provisioned.
After we create the project, we can add data to it. When we click on the + Add to Project button, we’ll see something like this:
We can either choose to add Data which lets us upload data directly from our device, or we can choose to add data from remote storage by selecting Connection. There are several available options for the task.
For this example, we’ll use a dataset we upload locally.
But first, let’s check out a Db2 Warehouse connection setup as another example. Be sure to have the access credentials for the data connection that you’re trying to add.
We can select Discover Data Assets to automatically add the datasets from the storage to our project of choice.
We can also add collaborators to our projects to integrate our work with that of our team members:
Part Two: Search and Find Relevant Data
Once we have our data assets in place, we can start cataloging the datasets that are important to us. For this, we’ll go back to the IBM Watson Studio home screen and select Create a Catalog.
There, we can choose to enforce certain data policies on our datasets. For this example, I’ve chosen to enforce all data policies. Later, we’ll see how these policies come into play.
Let’s check out my catalog by way of its landing page:
Note that all the assets in this catalog have tags, ratings, and associated reviews. As we build our catalog, these features become more and more useful — as we can quickly filter by highly rated assets or assets of a certain category and use them in projects. Under Access Control, we can add collaborators to the catalog.
Now that we’re inside this catalog, let’s explore some of the assets, starting with the structured dataset German Credit Data.csv.
The Overview tab gives a quick description of our dataset. The Access tab shows who has access to the data. The Review tab lets us rate and comment on the dataset; the Profile tab shows statistics for our dataset. The Profile tab looks something like this:
Finally, the Lineage tab shows the life cycle of a dataset:
Here, the first yellow "+" shows when we added the asset to the catalog. We can also see who added it and other metadata. The "pencil" markers describe asset updates, for example, when tags were created for the asset. The joined nodes tell when, where, and by whom the asset was added to a project.
Let’s now take a step back to look briefly at how these features function for unstructured data. I’ll open the file Pride and Prejudice.pdf. The Overview tab is very similar to the one for our credit data, but interesting things happen under the Profile tab. Watson Studio can run Watson Natural Language Understanding on the data and display particular information about it:
Similarly, in addition to Categories, we can see Concepts, Sentiment, or Emotion.
I mentioned that we’d return to the notion of enforcing policies. Those policies come into play whenever we’re adding a dataset to our catalog. We’ll have the option of classifying it as a type of dataset governed by a certain policy.
When we classify a dataset as belonging to any of these categories, we’ll be able to see its classification under the Overview tab.
Now that we know we can add datasets from projects to catalogs, as well as from catalogs to projects; let’s see what else we can do within a project.
Part Three: Prepare Data for Analysis
In this part of the data science workflow, Data Refinery becomes useful. Refinery is a self-service data preparation tool for data scientists and analysts. When we open a data asset within a project, at the top right corner we’ll see the option to Refine that data.
Selecting Refine takes us to the Data Refinery service, where we can perform operations by first choosing a column and selecting the transformation we want to apply from the drop-down menu.
Then, we can apply more customized transformations by entering R code into the box that says Code an operation to cleanse and shape your data.
Once we have added all the transformation steps into our refinery, we can save and run the flow. In this example, Watson Studio saves the flow name as German Credit Data.csv_flow and the output of the flow (our refined data) under assets as German Credit Data.csv_shaped.csv.
Another important element of data exploration is creating a visual representation of the data. For this, we can create various Analytics Dashboards to quickly analyze our data. Within our project, we can choose to add a dashboard, give our new dashboard a name, and save it. Watson Studio then redirects us to a window where we can choose a template for our dashboard based on the number of graphs we might want to add.
Once we select what we want, we see the following window:
We can click "+" next to Selected sources to add desired datasets to our dashboard. This example uses German Credit Data.csv. We can then select the kind of charts we want to add to our dashboard.
Below, I add a bar chart for my first box. There, I can drag, for example, the Sex column in front of both the Bars and Length elements to get the count of males and females in my dataset.
This visualization then becomes my top left element in the dashboard below. Carrying on, I can add other visualizations and create the following credit data dashboard.
Now, we can start building a deeper analysis of our dataset. For example, we can see that men at age 42 take high loans in terms of total sum. So, we can filter out that subset and look for insights from other graphs that change dynamically.
We can add multiple tabs to the same dashboard using the "+" icon next to the first tab.
Data exploration is an incredibly important part of the data science workflow for both understanding our data and planning future analysis. As we’ve seen, the tools within Watson Studio make the process straightforward and ensure accuracy.
Part Four: Build, Train and Deploy Machine Learning/Deep Learning Models
Watson Studio Cloud offers multiple ways to build, train, and deploy models.
Coding (within Python, Spark, Scala or R environments in Jupyter notebooks, or R Studio).
Using drag and drop methods in Modeler Flows.
Using Automated Machine Learning with AutoAI.
Modeling With Coding Methods
Notebooks are a neat way to break up our chunks of code and make it more interactive in order to tell a story about our data science project. The Community Contribution in Watson Studio Cloud has some great resources to get started.
Let’s return to our German Credit Risk Model. After some data preprocessing, I built a scikit-learn Logistic Regression model to classify my population into “Risk” and “No Risk” categories. Once we’ve trained the model, we can save it in Watson Machine Learning (WML) and create a model deployment:
Once we’ve added our credentials, we instantiate a WML object:
Then, we save the model in WML.
To create a deployment, we’ll need to fetch the model UID:
Now, we’re ready to use this deployment to score a model. First, we’ll get our scoring endpoint and then send our payload to this endpoint for scoring.
This is how to do model training —testing — deployment — scoring cycle in a notebook using Python. We could also replicate this using R.
Modeling With Drag-and-Drop Methods
Drag-and-drop methods can be useful for both machine learning and deep learning models. To get our understanding of the service, let’s take a small detour from our credit risk scoring use case to look at a deep learning flow for an image classification model. We can build a neural network for credit scoring in the same way.
At the right, we can set the parameters for each of the cells — a Conv 2D layer, in this case.
At the left, we see the palette where we can select the element we need on our canvas. After we’ve created the desired flow, we can import this flow as TensorFlow/Keras/PyTorch/Caffe code.
We can also save this as a training definition for running experiments using Experiment Builder on Watson Studio. We can use Experiment Builder to test the deep learning models we build. We start by adding “Experiment” to the project.
Under + Add Training Definition, we can choose Add Existing Training Definition and pick the one we saved from the Modeler.
Above, we see the output of an experiment run.
We can also use Experiment Builder for deep learning models built in Python using frameworks such as Keras or TensorFlow.
Modeling With Auto AI
Building AutoAI experiments within Watson Studio is quite straightforward. We simply upload the data set we want to run AutoAI for and choose the prediction column and the metric we want to optimize.
Then, hit Run experiment. We’ll then see the experiment pipeline:
We can see the model leaderboard and examine all the models we’d like to consider. Let’s look at some results from our top model:
Once we’re satisfied with the model we have, we can save it for use in downstream production.
Part Five: Monitor, Analyze, and Manage
Now, it is finally time to incorporate Watson OpenScale within Watson Studio to make this a full data science masterpiece.
After we create model deployments, we can integrate them with Watson OpenScale for continuous model monitoring. Once we’ve created and configured our Watson OpenScale instance, we can choose the deployments we want to monitor. Look for more information on setup in the documentation here.
While setting up our configurations we can define thresholds for how fair and accurate we want our model to be before we see alerts. After we’ve set our configurations, our landing page will look like this:
To look at any particular model we can just click on its tile:
These metrics give us insights into how accurate and fair our model is. At any point, we can decide to dig deeper into the analysis by clicking on Click to view details:
We can also see a list of all transactions as well as transactions that were biased.
The Explain option offers model explainability by letting us see how a model reached its conclusions:
So that’s it — we’re now at the end of our “End-to-End” journey. I hope this article helps you streamline your workflow better using IBM Watson Studio to create your own data science masterpiece. And I hope reading (and implementing) your project will be as exciting for you as building the demo and writing the article was for me.
Published at DZone with permission of Aakanksha Joshi , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.