{{announcement.body}}
{{announcement.title}}

JVM Advent Calendar: Apache Zeppelin: Stairway to *Notes* Heaven!

DZone 's Guide to

JVM Advent Calendar: Apache Zeppelin: Stairway to *Notes* Heaven!

Note stands for notebooks in Apache Zeppelin.

· Java Zone ·
Free Resource

Introduction

Continuing from the previous post, Two Years in the Life of AI, ML, DL, and Java, where I expressed my motivation, I mentioned our discussions, one of the discussions was that you can write in languages like Python, R, and Julia in JuPyteR notebooks. Most were not aware that you can also write Java and Scala in addition to Python, SQL, etc. with the help of Apache Zeppelin notebooks. And so, I wanted to share something to broaden everyone’s awareness of Apache Zeppelin and its features. The project itself is written in Java and is an open architecture, which means that Zeppelin can support anything as long as an interpreter for that thing has been provided.

First Things First

In case I have lost some of you, here’s what I meant by JuPyteR notebooks and writing notebooks in different languages. Also, have a look at the list of kernels supported by JuPyteR notebook. In this post, however, we are covering Apache Zeppelin, how to get it to work, and how to use a couple of notes in the Zeppelin environment.

So let’s have a look at how we do it by first downloading and installing Apache Zeppelin.

Download and Installation

Download

Go to the Download page, a number of options are available — two of the recommended options are:

  • Download the entire binary containing the interpreters
  • Download a net installer and then download the interpreters (you can choose the ones you need or use --all flag for all the interpreters)

In our case, I downloaded the net-install interpreter package from the download binary package section.

Installation

I unpacked the .tgz archive and placed it in the /opt/ folder and ran:

$ cd /opt/zeppelin-0.8.0-bin-netinst$ ./bin/install-interpreter.sh --all


For another type of archive or installation option, see the instructions on the Quick Start page.

Running

Depending on the type of binary downloaded, follow the instructions on the Quick Start page.

Although, in our case, I had to just run:

$ cd /opt/zeppelin-0.8.0-bin-netinst$ ./bin/zeppelin.sh


Optional Setting

As I was curious to see what it was running Zeppelin under — whether it was another JDK instead of the usual Oracle or OpenJDK JDK or JRE — I decided to try GraalVM JRE. So, I switched JAVA_HOME to point to /path/to/GraalVM/jre on my machine. The GraalVM JDK comes bundled with the JRE, which can be independently used just like any Java vendor’s JRE.

When Zeppelin is run, these messages are shown (you can see the JAVA_HOMEsettings have been picked up):

Pid dir doesn't exist, create /opt/zeppelin-0.8.0-bin-netinst/runGraalVM 1.0.0-rc7 warning: ignoring option MaxPermSize=512m; support was removed in 8.0SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/opt/zeppelin-0.8.0-bin-netinst/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/zeppelin-0.8.0-bin-netinst/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]Dec 25, 2018 1:34:23 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntimeWARNING: A provider org.apache.zeppelin.rest.NotebookRepoRestApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.zeppelin.rest.NotebookRepoRestApi will be ignored.Dec 25, 2018 1:34:23 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntimeDec 25, 2018 1:34:23 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime[---- snipped ----]WARNING: The (sub)resource method getNoteList in org.apache.zeppelin.rest.NotebookRestApi contains empty path annotation.


Running (Continued)

Once all the above steps are completed and Zeppelin has successfully started, complete the steps below:

Small Experiment

Just to look at some numbers, I decided to use the Zeppelin Tutorial/Basic Features (Spark) notebook to check the difference in performance when run using GraalVM JDK/JRE and another JDK/JRE. Here are the results:

GraalVM JDK

  • ./bin/zeppelin.sh 48.26s user 25.63s system 28% cpu 4:20.15 total (started and stopped the script manually)
  • First paragraph
  • Took 47 sec. Last updated by anonymous at December 25 2018, 2:18:36 AM.
  • Each paragraph thereafter (columns from left to right):
  • Took 44 sec. Last updated by anonymous at December 25 2018, 2:18:44 AM. (outdated)
  • Took 10 sec. Last updated by anonymous at December 25 2018, 2:18:47 AM. (outdated)
  • Took 6 sec. Last updated by anonymous at December 25 2018, 2:18:50 AM. (outdated)

Oracle JDK8

  • ./bin/zeppelin.sh 37.64s user 25.73s system 29% cpu 3:38.49 total (started and stopped the script manually)
  • First paragraph
  • Took 54 sec. Last updated by anonymous at December 25 2018, 2:12:16 AM.
  • Each paragraph thereafter (columns from left to right):
  • Took 43 sec. Last updated by anonymous at December 25 2018, 2:12:24 AM. (outdated)
  • Took 13 sec. Last updated by anonymous at December 25 2018, 2:12:29 AM. (outdated)
  • Took 6 sec. Last updated by anonymous at December 25 2018, 2:12:31 AM. (outdated)

My observations are that the performance differences were marginal; although, for different kinds of operation, the results would vary between the two. Hence, more observations are needed. It is best to stay put on GraalVM JRE unless otherwise indicated to see more such variations as we go along.

Note: paragraphs are code blocks in Zeppelin lingo, the note is what a notebook is referred to as in the Zeppelin world. Hence, for an idea, a note has one or more paragraphs.

There are many other tutorials (sample) notes to play with. Check out the home page under Zeppelin Tutorial (see screenshot):

How to Import a Note

From the home page (http://localhost:8080/#/, see below). We can select the hyperlinked text Import Note, which allows us to import a note (Notebook in Zeppelin lingo) from disk or from a URL.

In our case, I added the note from https://github.com/mmatloka/machine-learning-by-example-workshop (ensure the link to the raw contents of the json file is used, i.e. https://raw.githubusercontent.com/mmatloka/machine-learning-by-example-workshop/master/Workshop.json) into Zeppelin and tried running but got various errors when trying to run the first couple of paragraphs.

Looking for answers as to why I was getting those errors, I came across a forum, and then, I took upon the suggestion from someone on the forum where similar errors messages were reported. It was a workaround to fix the issue: https://issues.apache.org/jira/browse/ZEPPELIN-3586.

We Failed the Previous Time, So Let’s Try Again…

One of the solutions was to make SPARK_HOME point to a separate instance of Spark and not rely on the embeddedspark interpreter inside the Apache Zeppelin installation. As a workaround, a link to a Dockerfile gist was provided at https://gist.github.com/conker84/4ffc9a2f0125c808b4dfcf3b7d70b043#file-zeppelin-dockerfile. I extended the script to incorporate GraalVM JRE and added the necessary configuration for it to be visible to Zeppelin and Spark:

Zeppelin-Dockerfile

FROM apache/zeppelin:0.8.0# Workaround to "fix" https://issues.apache.org/jira/browse/ZEPPELIN-3586RUN echo "$LOG_TAG Download Spark binary" && \wget -O /tmp/spark-2.3.1-bin-hadoop2.7.tgz http://apache.panu.it/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz && \tar -zxvf /tmp/spark-2.3.1-bin-hadoop2.7.tgz && \rm -rf /tmp/spark-2.3.1-bin-hadoop2.7.tgz && \mv spark-2.3.1-bin-hadoop2.7 /spark-2.3.1-bin-hadoop2.7ENV SPARK_HOME=/spark-2.3.1-bin-hadoop2.7### My modified steps here on:RUN rm -fr /usr/lib/jvm/java-1.8.0-openjdk-amd64 /usr/lib/jvm/java-8-openjdk-amd64RUN wget https://github.com/oracle/graal/releases/download/vm-1.0.0-rc10/graalvm-ce-1.0.0-rc10-linux-amd64.tar.gzRUN tar xvzf graalvm-ce-1.0.0-rc10-linux-amd64.tar.gzRUN mv graalvm-ce-1.0.0-rc10/jre /usr/lib/jvm/graalvm-ce-1.0.0-rc10ENV JAVA_HOME=/usr/lib/jvm/graalvm-ce-1.0.0-rc10ENV PATH=$JAVA_HOME/bin:$PATHRUN java -versionRUN rm graalvm-ce-1.0.0-rc10-linux-amd64.tar.gzRUN rm -fr graalvm-ce-1.0.0-rc10CMD ["bin/zeppelin.sh"]


And then, I created two small bash scripts to help build the Docker image and run the container from the image.

Build the Docker Image

docker build -t zeppelin -f Zeppelin-Dockerfile .


Run the Docker Container

docker run --rm \
            -it \
             -p 8080:8080 zeppelin


Note: the Docker image is calledzeppelin:latest, and it is about 4.45GB in size. The above scripts can be found here; please feel free to improve them and create pull requests back into the repo.

In case you don’t wish to do the above, you could try using https://github.com/dylanmei/docker-zeppelin. Apache Zeppelin works out of the box using this container as well.

I wasn’t too keen with the above process because it took more than 45 minutes, 35 mins of which went into downloading several MBs of Spark. Downloading the GraalVM JDK was a breeze, taking less than five minutes on my high-speed DSL connection.

When the same steps above were applied to load Michal Matloka’s Workshop notebook (workshop.json) and we ran the paragraphs in the notebook, it worked like a charm, without any errors, of course. Thanks, Michal Matloka, for providing such an example to play with and learn multiple things in one go.

From loading the dataset from a .csv file:


to produce the final outcome, via the parameter avgMetrics – average cross-validation metrics for each paramMap in CrossValidator.estimatorParamMaps, in the respective order.

A score of 53.18percent might still need a bit of tweaking and fine-tuning to achieve a higher score, but that is a different discussion and tangents from our current topic on Zeppelin notes.

Caveat

Somehow, Zeppelin does not like code layouts with such indentations:

val indexToString = new IndexToString()
.setInputCol("prediction").setOutputCol("predictionLabel")
.setLabels(stringIndexer.labels)


So, when I removed the indentation to join the chain of function calls together:

val indexToString = new IndexToString().setInputCol("prediction").setOutputCol("predictionLabel").setLabels(stringIndexer.labels)


I was able to run the paragraphs fine. However, I had to do this to all the paragraphs to prevent any errors from Zeppelin. Or else, you get messages of such nature across all the paragraphs:

:1: error: illegal start of definition
.setInputCol("prediction").setOutputCol("predictionLabel")
^


Summary

Things I like about Zeppelin include:

  • You have a clean and intuitive interface (must be Angular at work)
  • You can write custom interpreters and expand the accepted list of languages
  • Write your own visualizers
  • Execution progress of every paragraph is displayed in real-time
  • The execution time of every paragraph is computed and displayed in real-time
  • Wherever applicable, a table of data can be visualized into a number of visuals and back to table of data — and all of this is done lazily (only executed when selected and keeps the results static)

Although, execution can appear to be slower than JuPyteR notebooks. A number of bells and whistles available in IPython notebooks are absent, which also means that, being an open-source project, it leaves a lot of room for improvements via contributions — pick your favorite feature of choice for a pull request.

All-in-all, this is a great place for Java/JVM developers to feel at home and use Zeppelin to do their prototype, ML training, and experimentation work for developers familiar with not just Python and R but also Java and Scala.

Please keep an eye on this space and share your comments, feedback, or any contributions that will help us all learn and grow to @theNeomatrix369.

Topics:
apache zeppelin ,jupyter notebooks ,notebooks ,docker ,graalvm ,python ,java ,scala

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}