In-Database Analytics with R: Part I
Over the past few years, the open source scripting language for statistical data processing “R” gained a lot of attention in the market.
Join the DZone community and get the full member experience.Join For Free
...This is our first post in our R blog series.
Why is R So Popular?
Over the past few years, the open source scripting language for statistical data processing “R” gained a lot of attention in the market. Even established vendors in that area feel the headwind “R” is causing. Together with the CRAN archive, extending “R” with the latest analytical algorithms is achieved by downloading so called “packages”. There are more than 5000+ packages listed on the CRAN server today. Not all of them share the same level of quality, but why not give it a try - it’s free!
R packages cover a wide range of areas. From classical statistics, life sciences, signal processing, machine learning up to a live link to Open Street Map - just to name a few. It also ships with powerful visualization capabilities.
Consequently, many students learn R during their time at the University and want to stick what they know best when they start a professional career in data analytics. R can be seen as a “one stop shop” for statistical data analytics - but with problems when it comes to data processing at scale!
Limitations of R
The major drawback is that R is hitting the memory limit very quickly when data sets are getting bigger. Furthermore, the execution of R code is single-threaded and processing large data sets in a parallel fashion is not the best use case for R.
There are packages available for making R execute calculation in parallel. However, this is putting a huge burden on the person writing and maintaining the R code - which are typically the people who should present results rather than develop technically complex code.
Combination of R with MPP-Sytems
In order to overcome some of the limitations mentioned before, embedding R-code execution into Massive Parallel Processing Engines (MPP-Engines) is a straight forward thing to do. Analytical relational databases follow the MPP principle since decades and combining both technologies is a key to success in many areas such as:
- Data Storage and Access – these areas are database homeground.
- Massive Parallel Processing – perfectly implemented in database products like ParStream.
- Integration with existing tools – R-execution is triggered by SQL commands. Hence, many front end tools will be able to trigger an analytical algorithms implemented in R and show/visualize the results using the existing tools.
In Which Phases of the Analytical Process is R Most Beneficial?
Many data analysts or data scientist follow the steps defined in the CRISP process. It includes steps from business understanding up to the deployment of an analytical solution to a production environment. A key observation with the “Cross Industry Standard Process for Data Mining” (CRISP) process is iteration - steps are being repeated again and again to obtain better results or even result in not considering an idea any further as the results are not promising. It is all about “Fail Forward” or “Fail Fast” in order to gain a competitive advantage with data analytics.
This has two major implications when dealing with large data sets. Firstly, the technology stack should consist of an integrated set of tools. Secondly, the analysts using this tool stack should be familiar with all tools in order to select the right tool for the task ahead.
When looking at the CRISP process in more detail, R can add a huge advantage in the step where predictive models are being build and when it comes to model deployment (aka. “Scoring”). Understanding of new data sets and preparing a data set for estimating the parameters of a model can be done using R, but analytical databases have been build with this use case in mind: Ingest large amounts of data and allow fast access to the data for example for data profiling (counts of distinct values, sums, null values etc).
Preparing data sets to be used as input to an analytical process – which still consumes a lot of the analyst’s time - typically fits the SQL set-processing approach pretty well. Data preparation also means the usage of filters - again something analytical databases can do at vast scale with ease. Creating multiple data sets - e.g., defining multiple views on the same base table - supports the evolution of a predictive model (k-fold cross validation etc.) very nicely. All of this can be done at scale without moving large amounts of data between different systems. Making the handling of large data sets a straightforward task.
Let’s think this this a little bit further. Rather than moving the data to the analytics (extracting data into files to be picked up in R later), moving the analytics to the data is more than obvious (creating a user defined function in the database using R). Being able to link SQL execution with the execution of R-processing in parallel on the same system means a huge benefit to the analytical team.
By having removed the data handling burden from the data science team plus leveraging analytical databases parallel processing capabilities, allows each data scientist to do more iterations through the analytical process per day. This also translates to more analytical ideas to be evaluated in the same time. It also means that analysts can focus on improving models or better validate the models, as this becomes a simple task when using the power of SQL for defining the analytical data set to be processed. Plus storing the analytical result in the same system makes the result easy accessible by tools used in the business processes already today.
Published at DZone with permission of Michael Hummel, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.