{{announcement.body}}
{{announcement.title}}

Machine Learning in a Box (Part 7): Jupyter Notebook

DZone 's Guide to

Machine Learning in a Box (Part 7): Jupyter Notebook

Look at what Jupyter Notebook is and also explore how to install it as well as look at how you can use the SAP HANA Python driver to connect and consume your data.

· AI Zone ·
Free Resource

A Quick Recap 

Last time, we looked at how to leverage the SAP HANA R integration, which opens the door to about 11,000 packages. So, if you feel like the built-in libraries (APL and PAL) don't offer what you need or if you feel like doing something your way too, now you can!

I hope you all managed to try this out, and probably some of you already started comparing the PAL implementation with R algorithms. Feel free to share your feedback!

I know that I promised to dive into the TensorFlow integration last time, but due to technical difficulties on my NUC and some of the Virtual Machines I run, which I had to rebuild a couple of times (yes, can still mess up a system if you don't follow instructions), I decided to postpone this piece until I can figure out an easy path to set up the TensorFlow Serving ModelServer on a SUSE or Red Hat system, and provide an interesting but simple model example.

But this week, the topic will be Jupyter Notebook!

You may ask why Abdel is talking about Jupyter Notebook as this is not a SAP product. The reason is simple: it's a great tool to be used with SAP products. But it's also because the use of Jupyter Notebook is becoming very common these days for Machine Learning related activities.

This is probably the perfect tool to use with SAP HANA, express edition for all your Machine Learning activities as it will allow you to do almost everything you need there:

  • Run with SQL queries and use the results in visualizations
  • Code in Python or R and leverage the thousands of available packages while consuming data from your SAP HANA instance
  • and so many other things

About Jupyter Notebook and Project Jupyter

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.

It uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, Machine Learning, and much more.

Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages.

Jupyter is developed in the open on GitHub, through the consensus of the Jupyter community. For more information on our governance approach, please see our Governance Document.

Installing Jupyter Notebook

You can install Jupyter Notebook both locally on your machine or on your SAP HANA, express edition server.

The installation is pretty straightforward if you choose a local installation, and you can refer to the online installation guide.

Now, if you decide to setup Jupyter on your SAP HANA express edition, you will need to consider a few things that I have documented in a tutorial.

For example, you will need to generate a configuration file and enable the use of the machine IP address instead of localhost in addition to setting up SSL (optional) and a password (optional).

As usual, I produce a tutorial that guides you through the installation, the configuration, and the connectivity to your SAP HANA, express edition instance. Here is the link:

Feel free to use the "provide feedback" link in the tutorial to let me know what you think about it.

SAP HANA, Express Edition + Jupyter + SQLAlchemy = Magic

You have probably realized that Jupyter was initially designed to run Python script (and is actually built using a series of python modules).

This means that you can use the SAP HANA Python driver to connect and consume your data. But would require you to write Python code using the Python Database API.

What if I can tell you that you could build a Jupyter Notebook that runs SQL by using the Python Database API?

This would be like magic? In fact, it's ipython-sql magic. Thanks to Catherine Devlin works, you can now prefix your SQL statement with a simple %sql and get the results!

Using ipython-sql magic implies the use SQLAlchemy, a great Python SQL Toolkit and Object Relational Mapper module, and the SQLAlchemy Dialect for SAP HANA, which is part of the open-source SAP repository on Git.

To start playing with it, add and adjust the following code in the first cell of your notebook:

import sqlalchemy
%reload_ext sql
%config SqlMagic.displaylimit = 5

hxe_connection = 'hana://ML_USER:Welcome18Welcome18@hxehost:39015';

%sql $hxe_connection

Then you can prefix all your SQL statement with a %sql like this:

%sql select * FROM M_DATABASE;

If you want you can then manipulate the return result set (API):

result = _ 
print(result)

SAP HANA, Express Edition + Jupyter + (R or Python) = Code, Test, Demo, and Share

When you enter the "Machine Learning" universe, you will probably think that your day to day job will be to run algorithms on your data.

But this is quite far from reality.

As you read in Machine Learning in a Box (week 2): Project Methodologies, a big portion of the effort is spent (and sometimes wasted) on data preparation.

So you will invest plenty of your time analyzing data and finding evidence that you will need to document. And I think (and from experience with building lots of documentation and PowerPoint) you have to make it as interactive as possible.

If we look at the Iris flower dataset, and even if it's "just" 4 numerical attributes to predict species, you may use different visualization to achieve that with your language of choice.

Let's say you want to plot the frequency of species based on each of the 4 attributes, this is what you would code with Python Kernel in Jupyter as an example using sklearn and pyplot from matplotlib:

import pandas as pd
import matplotlib.pyplot as plt 

from sqlalchemy import create_engine
engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015')

iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine)

target_names  = iris_db.Species.unique()
feature_names = list(iris_db.drop('Species', axis=1).columns.tolist())

fig, axes = plt.subplots(figsize=(20,10), nrows=2, ncols=2)
fig.suptitle("Iris Frequency Histogram", fontsize=16)
colors= ['red', 'green', 'blue']

for i, ax in enumerate(axes.flat):
    for idx_label, color in zip(range(len(target_names)), colors):
        ax.hist(
            iris_db.loc[iris_db.Species == target_names[idx_label], feature_names[i]], 
            label=target_names[label], 
            color=color
        )
        ax.set_xlabel(feature_names[i])
        ax.set_ylabel('Frequency')
axes[1,1].legend(loc='upper right')
plt.show()

But you can do exactly the same with an R Kernel:

library(ggplot2) 
library(grid)
library(gridExtra)
library("RJDBC")

jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar")
jdbcConnection <- dbConnect(
    jdbcDriver,
    "jdbc:sap://localhost:39015/?autocommit=false",
    "ML_USER",
    "Welcome18Welcome18"
)
iris <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS")

sl <- ggplot(data=iris, aes(x=Sepal.Length)) + 
    geom_histogram(binwidth=0.2, aes(fill=Species)) + 
    xlab("Sepal Length") + 
    theme(legend.position="none") + 
    ylab("Frequency")

sw <- ggplot(data=iris, aes(x=Sepal.Width )) + 
    geom_histogram(binwidth=0.2, aes(fill=Species)) + 
    xlab("Sepal Width" ) + 
    theme(legend.position="none") + 
    ylab("Frequency")

pl <- ggplot(data=iris, aes(x=Petal.Length)) +
    geom_histogram(binwidth=0.2, aes(fill=Species)) +
    xlab("Petal Length") +
    theme(legend.position="none") +
    ylab("Frequency")

pw <- ggplot(data=iris, aes(x=Petal.Width )) +
    geom_histogram(binwidth=0.2, aes(fill=Species)) +
    xlab("Petal Width") +
    theme(legend.position="none") +
    ylab("Frequency")

grid.arrange(sl, sw, pl, pw, ncol = 2, nrow = 2, top = textGrob("Iris Frequency Histogram"))

dbDisconnect(jdbcConnection)

And this will output something like this:

Or using scatter plots in Python:

import pandas as pd
import matplotlib.pyplot as plt 

from sqlalchemy import create_engine
engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015')

iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine)

target_names  = iris_db.Species.unique()
feature_names = list(iris_db.drop('Species', axis=1).columns.tolist())
feature_len = len(iris.feature_names)
target_len  = len(iris.target_names)


fig, axes = plt.subplots(
    figsize=(20,10),
    nrows=dim, 
    ncols=dim, 
    sharex='col', 
    sharey='row'
)
fig.suptitle("Edgar Anderson's Iris Data", fontsize=16)

plot_colors = ['blue', 'white', 'red']

for x in range(feature_len):
    for y in range(feature_len):
        ax = axes[x,y]
        if x == y:
            ax.text(0.5, 0.5, 
                    feature_names[x], 
                    transform=ax.transAxes, 
                    horizontalalignment='center', 
                    verticalalignment='center', 
                    fontsize=16
                   )
        else:
            for idx_class, color in zip(range(target_len), plot_colors):
                idx = np.where(iris_db.Species == idx_class)                
                ax.scatter(
                    iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[x]], 
                    iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[y]], 
                    c=color, 
                    cmap=plt.cm.RdYlBu, 
                    label=target_names[idx_class], 
                    edgecolor='black', 
                    s=40
                )
plt.show()

And in R:

library("RJDBC")
jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar")
jdbcConnection <- dbConnect(
    jdbcDriver,
    "jdbc:sap://localhost:39015/?autocommit=false",
    "ML_USER",
    "Welcome18Welcome18"
)
iris_db <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS")

# repalce text value by color
iris_db$SpeciesColor[iris_db$Species == "setosa"] <- "blue"
iris_db$SpeciesColor[iris_db$Species == "versicolor"] <- "white"
iris_db$SpeciesColor[iris_db$Species == "virginica"] <- "red"

pairs(
    iris_db[1:4], 
    main = "Edgar Anderson's Iris Data", 
    pch = 21, 
    bg = unclass(iris_db$SpeciesColor)
)
dbDisconnect(jdbcConnection)

What other options do I have to "demo and share" but without too much coding?

Then I think SAP Lumira is what you are looking for. And the good news is that you even have now a dedicated openSAP course.

Conclusion

Adding Jupyter to our software stack will really help you achieve some of the key goals in data science: code, test, demo, and share.

Jupyter is definitely the tool I'll use and push during the next hackathon I will attend or host.

Topics:
ai ,machine learning ,jupyter ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}