Machine Learning in a Box (Week 6): SAP HANA R Integration
Let's take a look at a tutorial that explains how to enable the embedding of R code in the SAP HANA database context.
Join the DZone community and get the full member experience.Join For Free
A Quick Recap
Last time, we looked at how to import data in SAP HANA express, and we used the dataset provided by the SAP Predictive Analytics tools (and available online).
But the main idea was to show you how you can import more or less any kind of text/CSV files in your HXE instances.
I hope you all managed to try this out, and probably some of you already started playing with some classification algorithms and the Census dataset or some Forecasting algorithms with the Time Series data available.
If you didn't start playing with algorithms, don't worry, the second part will deal with this. So, let's complete our setup with the SAP HANA R integration.
Next, we will look at the External Machine Learning (EML) library.
The SAP HANA R integration is a bit different from the Machine Learning capabilities already available with the AFL libraries (APL and PAL).
With the R integration, you literally can execute R code in SQLScript. Okay, not directly like a simple SQL SELECT, but using RLANG.
Now you may ask, "What is RLANG?" Before answering that, let's first put some context in explaining what R is (for those who never heard about it), then, we will have a look at how the integration works, and finally, we will look at to use it or how it could be used.
What Is R?
R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a command line interface, there are several graphical front-ends available.
The capabilities of R are extended through user-created packages, which allow specialized statistical techniques, graphical devices, import/export capabilities, reporting tools (knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran. A core set of packages is included with the installation of R, but more than 11,000 additional packages (as of July 2017) are available at the Comprehensive R Archive Network (CRAN), GitHub, and other repositories.
The SAP HANA R Integration Scenarios
The goal of the integration of the SAP HANA database service with R is to enable the embedding of R code in the SAP HANA database context. That is, the SAP HANA database allows R code to be processed in-line as part of the overall query execution plan using RLANG.
This scenario is suitable when an SAP HANA-based modeling and consumption application wants to use the R environment for specific statistical functions for example not provided by the built-in libraries. An efficient data exchange mechanism supports the transfer of intermediate database tables directly into the vector-oriented data structures of R.
This offers a performance advantage compared to standard SQL interfaces, which are tuple based and, therefore, require an additional data copy on the R side.
The SAP HANA R Integration Explained
To process R code in the context of the SAP HANA database, the R code is embedded in SAP HANA SQL code in the form of a RLANG procedure. The SAP HANA database uses the external R environment to execute this R code, similarly to native database operations like joins or aggregations.
This allows the application developer to elegantly embed R function definitions and calls within SQLScript and submit the entire code as part of a query to the database.
The diagram below depicts the overall integration:
When the calculation model plan execution reaches an R-operator, the calculation engine's R-client issues a request through the Rserve mechanism to create a dedicated R process on the R host. Then, the R-Client efficiently transfers the R function code and its input tables to this R process and triggers R execution.
Once the R process completes the function execution, the resulting R data frame is returned to the calculation engine, which converts it. Since the internal column-oriented data structure used within the SAP HANA database for intermediate results is very similar to the vector-oriented R data frame, this conversion is very efficient.
A key benefit of having the overall control flow situated on the database side is that the database execution plans are inherently parallel and, therefore, multiple R processes can be triggered to run in parallel without having to worry about parallel execution within a single R process.
Configure the SAP HANA R Integration With SAP HANA, Express Edition
The pre-built versions of R are not compiled with dynamic/shared libraries enable which is required for the SAP HANA integration. Therefore, you must compile the R package from its source code with the dynamic/shared libraries.
You can find all the details about that in the following tutorial:
At the end, you will also test the configuration by uploading one of the R built-in datasets (Iris).
Further details can also be found in the SAP HANA R Integration Guide.
As you may have noticed with the last step of the tutorial, you can access the R dataset and load them inside of SAP HANA.
CREATE COLUMN TABLE IRIS ( "Sepal.Length" DOUBLE, "Sepal.Width" DOUBLE, "Petal.Length" DOUBLE, "Petal.Width" DOUBLE, "Species" VARCHAR(5000) ); CREATE PROCEDURE LOAD_IRIS(OUT iris "IRIS") LANGUAGE RLANG AS BEGIN library(datasets) data(iris) iris <- cbind(iris) END; CREATE PROCEDURE DISPLAY_IRIS() AS BEGIN CALL LOAD_IRIS(iris); INSERT INTO IRIS SELECT * FROM :iris; END; CALL DISPLAY_IRIS(); SELECT * FROM IRIS;
This means that you can now import any of the sample datasets available in R. And guess what, R provides "datasets" packages with over a hundred datasets as listed in the package documentation.
You can find all the details about that in the following tutorial:
These datasets are really handy in terms of education as they are all associated with an R code example for you to try and compare with SAP HANA APL and PAL for example.
Now that you have the R integration setup, you can compare one of the PAL algorithms with R using the same dataset, like Census.
With the SAP HANA R integration, we are almost done with the environment setup as we are just missing the EML library, which we will dive into next time.
This means that we will install a TensorFlow serving server and connect our SAP HANA, express edition to it and consume a simple model (which I need to find now).
Published at DZone with permission of Abdel Dadouche, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.