Machine Learning in a Box (Week 5): Upload Machine Learning Datasets
Let's take a look at uploading Machine Learning datasets as well as explore SAP predictive analytics and using a GUI.
Join the DZone community and get the full member experience.Join For Free
In case you are catching the train running, here is the link to the introduction article of the Machine Learning in a Box series, which allows you to get the series from the start.
A Quick Recap
Last time, we looked at which SAP HANA flavor you can pick (Server only), some hardware requirements that will be needed, and solutions if you don't have such a machine.
I hope you all managed to build up your instance and connect to it using your favorite SQL query tool.
After Craig Cmehil's article, What's your setup? Care to share? #mydevsetup, feel free to share your setup this week (later too), for opinions, recommendations, etc.
Now that you have a SAP HANA express edition instance up and running, you can start loading data.
I'm not going to ask you to load a petabyte of data (even though I previously uploaded about 50GB of a flat file during a Hackathon using only 3GB of RAM), let's be realistic, and let's keep these challenges once we gain more skills around HANA.
The data that you will upload are part of the SAP Predictive Analytics sample data set.
I used these datasets for the last 8 years to demonstrate not only how the product worked, but also to explain how the algorithm works, the value of automation, etc.
Let first introduce properly the SAP Predictive Analytics, then we will have a look at the sample datasets.
SAP Predictive Analytics
SAP Predictive Analytics was born in 2014 if I remember it right, about a year after the acquisition of KXEN by SAP.
SAP had built a tool called SAP Predictive Analysis to address the need for a data scientist persona.
At that time, SAP Predictive Analysis was already able to consume data from SAP HANA leveraging SAP HANA Predictive Analysis Library (PAL) and the SAP HANA R integration, or consume data from more or less any database with JDBC driver and still leverage about 20 built-in algorithms in addition to a local R integration.
On the other side, the KXEN brought InfiniteInsight and a series of automated algorithms, but also automated data preparation, the ability to extract the scoring equation for almost 40 different programming languages or database and a module dedicated to deployment and monitoring (Factory).
The so called KXEN algorithms are now under SAP intellectual property, so you won't find fine details on their implementation. What you can find is that it follows the Structural Risk Minimization by Vladimir Vapnik and Alexey Chervonenkis.
For those who don't know these 2 guys (Vladimir Vapnik and Alexey Chervonenkis), they invented the original Support Vector Machine algorithm 1963.
The intent with SAP Predictive Analytics was to merge the Automated Analytics (formerly the KXEN components) and the Expert Analytics (the SAP Predictive Analysis) into one product.
One of the first tasks right after the KXEN acquisition was to bring the automated algorithm inside:
- SAP HANA, which led to the SAP HANA Automated Predictive Library (APL)
- Expert Analytics side with an additional node for the offline and online mode
- every SAP application and solution (Hybris, SFSF, C4C, ...)
There was also a multitude of initiative where the automated analytics where embedded and it's completely invisible to the end user, like in Lumira or the Digital Boardroom.
Upload Data in SAP HANA, Express Edition
As a data practitioner, you already know that there is no magic when you have to deal with uploading data. You either use a tool with a GUI and configure it, or you build a script.
Using a GUI
The GUI option is fine if you don't have many files to upload or if you will do it once or twice. For that, you can use the SAP HANA Tools for Eclipse where the Import feature is there for you.
I wrote the following tutorial to introduce how it works: Import CSV into SAP HANA, express edition using the SAP HANA Tools for Eclipse.
The Import wizard from the SAP HANA Tools for Eclipse allows you to upload only local data from anywhere (where Eclipse is running), It also enables you to create table if it doesn't exist.
The scripting option actually leverages the use of the IMPORT FROM SQL command.
I wrote the following tutorial to introduce how it works: Import CSV into SAP HANA, express edition using IMPORT FROM SQL command.
The IMPORT FROM SQL command requires the data to be located in a specific location on the SAP HANA host (this can be reconfigured if needed). The recipient table must exist before running the command). It supports a multitude of options like date or time format, field delimiter, etc.
My preference goes to scripting as I have to admit, I'm a lazy guy, and if I can avoid some clicking, I will.
In addition, this option performs much better, especially when you start uploading larger files.
SAP Predictive Analytics Sample Dataset
SAP Predictive Analytics provides a series of sample dataset to help you get started using the tool itself.
And with version 3.3, they were all made available as part of the online documentation: https://help.sap.com/pa
On the bottom right-hand side, you will see the Samples section
You can then click on "View All" to access the full list of sample dataset.
I have prepared another tutorial to help you with this: Import SAP Predictive Analytics Datasets.
It explains how to import the following datasets:
- Association Rules Dataset
- Census Dataset
- Geolocalization Dataset
- Time Series Dataset
- Text Coding
- Social/Link Analysis
The Census dataset can be used for classification using the class variable as the target for clustering or regression using one of the continuous variables like age.
The Geolocalization dataset can be used for classification in conjunction with SAP HANA spatial capabilities or with the association and social algorithms.
Now you should have your HXE tenant ready with data loaded to run algorithms. Next week, we will continue with the environment preparation and look at the open source R integration.
For those who want to start with some algorithm, I recommend you use the Census dataset and one of the PAL algorithms, but you will have to share your experiments!
Published at DZone with permission of Abdel Dadouche, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.