Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

RSparkling: The Best of R + H2O + Spark

DZone's Guide to

RSparkling: The Best of R + H2O + Spark

When you combine R, H2O, and Spark together, you get the best of data science, machine learning, and data processing — all in one.

· AI Zone
Free Resource

Find out how AI-Fueled APIs from Neura can make interesting products more exciting and engaging. 

R is great for statistical computing, graphics, and small-scale data preparation. And H2O is an amazing distributed machine learning platform written in R designed for scale and speed. Spark is great for super-fast data processing at mega scale. So, combining all of three, you get the best of data science, machine learning, and data processing — all in one.

  • rsparkling: The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface for H2O’s high-performance distributed machine learning algorithms on Spark using R.

  • SparkR is an R package that provides a lightweight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc. — similar to R data frames like dplyr, but on large datasets. SparkR also supports distributed machine learning using MLlib.

  • H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON, and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.

  • Apache Spark is a fast and general engine for big data processing with built-in modules for streaming, SQL, machine learning, and graph processing.

  • Sparkling Water integrates H2O’s fast scalable machine learning engine with Spark. With Sparkling Water you can publish Spark data structures (i.e. RDDs, DataFrames, and datasets) as H2O frames and vice versa. You can DSL to use Spark data structures as input for H2O’s algorithms. You can create ML applications with Spark and H2O APIs, and the Python interface enables the use of Sparkling Water directly from PySpark.

Installation Packages

Quick Start Script

Sys.setenv(SPARK_HOME='/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.7')
options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
library(rsparkling)
library(sparklyr)
sc = spark_connect(master = "local", version = "2.1.0")
sc
h2o_context(sc, strict_version_check = FALSE)
library(h2o)
h2o.clusterInfo()
h2o_flow(sc)
spark_disconnect(sc)

Important settings for your environment include:

  • master = "local" to start local spark cluster
  • master = "yarn-client" to start a cluster managed by YARN
  • To get a list of supported Sparkling Water versions: h2o_release_table()
  • When you call spark_connect(), you will see that a new tab appears:
    • Tab “Spark” is used to launch “SparkUI”
    • Tab “Log” is used to collect spark logs
  • If there is an issue with Sparklyr or Spark, pass the exact version above; otherwise, you don't need to pass the version.

Startup the script with config parameters to set executor settings. These are the settings you will use to get our rsparkling/Spark session up and running in RStudio:

Sys.setenv(SPARK_HOME='/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.7')
options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
library(rsparkling)
library(sparklyr)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G”
config$spark.executor.instances = 3  <==== This will create 3 Nodes Instance
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
sc
h2o_context(sc, strict_version_check = FALSE)
library(h2o)
h2o.clusterInfo()
spark_disconnect(sc)

You can access Spark UI just by clicking the SparkUI button from the Spark tab as shown below:

Screen Shot 2017-10-28 at 9.54.48 AM

To access the H2O Flow API, just pass this command to open it on a selected browser:

h2o_flow()

Screen Shot 2017-10-28 at 9.55.03 AM

Building H2O GLM Model With rsparkling + sparklyr + H2O

In this example, we are ingesting the famous “CARS & MPG” dataset and building a GLM (generalized linear model) to predict the miles per gallon from a given specification of car capabilities:

options(rsparkling.sparklingwater.location = "/tmp/sparkling-water-assembly_2.11-2.1.7-all.jar")
library(rsparkling)
library(sparklyr)
library(h2o)
sc <- spark_connect(master = "local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
sciris_tbl <- copy_to(sc, iris)
mtcars_tbl <- copy_to(sc, mtcars, "iris1")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars", overwrite = TRUE)
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
mtcars_glm <- h2o.glm(x = c("wt", "cyl"),mtcars_glm <- h2o.glm(x = c("wt", "cyl"),y = "mpg",training_frame = mtcars_h2o,lambda_search = TRUE)
mtcars_glm
spark_disconnect(sc)

That’s all; enjoy!

To find out how AI-Fueled APIs can increase engagement and retention, download Six Ways to Boost Engagement for Your IoT Device or App with AI today.

Topics:
h2o ,r ,spark ,machine learning ,ai ,rsparkling ,tutorial ,data science ,data processing

Published at DZone with permission of Avkash Chauhan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}