{{announcement.body}}
{{announcement.title}}

Exploring NLP concepts using Apache OpenNLP inside a Java-enabled Jupyter notebook

DZone 's Guide to

Exploring NLP concepts using Apache OpenNLP inside a Java-enabled Jupyter notebook

In this article, we explore NLP concepts with Apache OpenNLP, using a Java-enabled Jupyter Notebook.

· Big Data Zone ·
Free Resource

jupiter-rings

Introduction

I have been exploring and playing around with the Apache OpenNLP library after a bit of convincing. For those who are not aware of it, it’s an Apache project, supporters of F/OSS Java projects for the last two decades or so. I found their command line interface pretty simple to use, and it is a great learning tool for beginning to work with Natural Language Processing (NLP).

To preface this article, make sure that you're familiar with Jupyter Notebooks. If you are not, have a look at this video these articles: [1] or [2]. For using the CLI, I’ll refer you to this post.

You may also like: Exploring NLP Concepts Using Apache OpenNLP.

Exploring NLP Using Apache OpenNLP

Jupyter Notebook: Getting started

Do the following before proceeding any further:

$ git clone git@github.com:neomatrix369/nlp-java-jvm-example.git

or

$ git clone https://github.com/neomatrix369/nlp-java-jvm-example.git
$ cd nlp-java-jvm-example


Then, see the Getting started section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.

Also, we have chosen the JDK to be GraalVM by default. You can see this from these lines in the console messages:

<---snipped-->
  JDK_TO_USE=GRAALVM
  openjdk version "11.0.5" 2019-10-15
  OpenJDK Runtime Environment (build 11.0.5+10-jvmci-19.3-b05-LTS)
  OpenJDK 64-Bit GraalVM CE 19.3.0 (build 11.0.5+10-jvmci-19.3-b05-LTS, mixed mode, sharing)
  <---snipped-->


Note
: a Docker image has been provided to be able to run a Docker container that would contain all the tools you need. You can see the shared folder has been created, which is linked to the volume mounted into your container, mapping your folder from the local machine. So, anything created or downloaded into the shared folder will be available even after you exit your container!

Have a quick read of the main README file to get an idea of how to go about using the docker-runner.sh shell script. Then, take a quick glance at the Usage section of the scripts as well.

Running the Jupyter Notebook Container

See Running the Jupyter notebook container section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.

All you need to do is run this command after cloning the repo mentioned in the links above:

$ ./docker-runner.sh --notebookMode --runContainer


Once you have the above running, the action will automatically open load the Jupyter Notebook interface for you in a browser window. You will have a couple of notebooks to choose from (placed in the shared/notebooks folder on your local machine):

Choosing your notebook

Installing Apache OpenNLP in the container

When inside the container in the notebook mode, you have two approaches to install Apache OpenNLP:

Viewing and Accessing the Shared Folder

See the Viewing and accessing the shared folder section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.

This will also be covered in a small way via the Jupyter notebooks in the following section. You can see directory contents via the %system Java cell magic and then from the command prompt a similar files/folders layout will be displayed.

Performing NLP Actions in Jupyter Notebook

While you have the notebook server running, you will see this launcher window with a list of notebooks and other supporting files that show up as soon as the notebook server launches:

Each of the notebooks above has a purpose. MyFirstJupyterNLPJavaNotebook.ipynb shows how to write Java in an IPython notebook and perform NLP actions using Java code snippets that invoke the Apache OpenNLP library. (See the docs for more details on the classes and methods and the Java Docs for more details on the Java API usages).

The other notebook, MyNextJupyterNLPJavaNotebook.ipynb, runs the same Java code snippets on a remote cloud instance (with the help of the Valohai CLI client) and returns the results in the cells, with just single commands. It’s fast and free to create an account and use within the free-tier plan.

We are able to examine the below Java API bindings to the Apache OpenNLP library from inside both the Java-based notebooks:

Exploring the Apache OpenNLP Java APIs 

We are able to do this from inside a notebook by running the IJava Jupyter interpreter, which allows writing Java in a typical notebook. We will be exploring the previously mentioned Java APIs using small snippets of Java code and see the results appear in the notebook:

So, go back to your browser and look for the MyFirstJupyterNLPJavaNotebook.ipynb notebook and have a look at it. Try reading and executing each cell and see the responses.

Exploring OpenNLP Java APIs With Remote Cloud Services

We are able to do this from inside a notebook. You can run the IJava Jupyter interpreter, which allows you to write Java in a typical notebook. But, in this notebook, we have taken it further and used the %system Java cell magic and the Valohai CLI magic instead of running the Java code snippets in the various cells like the previous notebook.

In this way, downloading models and processing text using the model does not happen on your local machine, but on a more sophisticated remote server in the cloud. You are also able to control this process from inside the notebook cells. This is more relevant when the models and the datasets to process are large and your local instance(s) do not have the necessary resources to support long-standing NLP processes. I have seen NLP training and evaluations to take long to finish, so high-spec resources are a must.

Again, go back to your browser and look for the MyNextJupyterNLPJavaNotebook.ipynb notebook and have a go with it. Try reading and executing each cell. All the necessary details are in there including links to the docs and supporting pages.

To get a deeper understanding of how these two notebooks were put together and how they work operationally, please have a look at all the source files.

Closing Jupyter Notebook

Make sure you have saved your notebook before you do this. Switch to the console window from where you ran the docker-runner shell script. Pressing Ctrl-C in the console running the Docker container gives you this:

<---snipped--->
[I 21:13:16.253 NotebookApp] Saving file at /MyFirstJupyterJavaNotebook.ipynb
^C
[I 21:13:22.953 NotebookApp] interruptedServing notebooks from local directory: /home/jovyan/work1 active kernel
The Jupyter Notebook is running at:http://1e2b8182be38:8888/
Shutdown this notebook server (y/[n])? y[C 21:14:05.035 NotebookApp] Shutdown confirmed
[I 21:14:05.036 NotebookApp] Shutting down 1 kernel
Nov 15, 2019 9:14:05 PM io.github.spencerpark.jupyter.channels.Loop shutdown
INFO: Loop shutdown.
<--snipped-->
[I 21:14:05.441 NotebookApp] Kernel shutdown: 448d46f0-1bde-461b-be60-e248c6342f69


This shuts down the container, and you are back to your local machine's command prompt. Your notebook stays preserved in the shares/notebooks folder on your local machine, provided you have been saving them as you kept changing them.

Other Concepts, Libraries, and Tools

There are other Java/JVM based NLP libraries mentioned in the Resources section below. For brevity, we won’t cover them. The links provided will lead to further information for your own pursuit.

Within the Apache OpenNLP tool itself, we have only covered the command line access part of it and not the Java Bindings. In addition, we haven’t gone through all the NLP concepts or features of the tool. Again, for brevity, we have only covered a handful of them. But the documentation and resources on the GitHub repo should help in further exploration.

You can also find out how to build the docker image by examining the docker-runner script.

Conclusion

This has been a very different experience than most of the other ways of exploring and learning, and you can see why the whole industry(Academia, Research, Data Science, and Machine Learning) have taken to ths approach like a storm. We still have limitations, but, with time, even they will be overcome, making our experience a smooth one.

Resources

IJava (Jupyter interpreter)

Jupyhai

Apache OpenNLP

Further Reading

Topics:
docker ,java ,nlp ,graalvm ,notebooks ,tutorial ,jupuyter notebook ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}