Exploring NLP concepts using Apache OpenNLP inside a Java-enabled Jupyter notebook
Exploring NLP concepts using Apache OpenNLP inside a Java-enabled Jupyter notebook
In this article, we explore NLP concepts with Apache OpenNLP, using a Java-enabled Jupyter Notebook.
Join the DZone community and get the full member experience.Join For Free
I have been exploring and playing around with the Apache OpenNLP library after a bit of convincing. For those who are not aware of it, it’s an Apache project, supporters of F/OSS Java projects for the last two decades or so. I found their command line interface pretty simple to use, and it is a great learning tool for beginning to work with Natural Language Processing (NLP).
You may also like: Exploring NLP Concepts Using Apache OpenNLP.
Exploring NLP Using Apache OpenNLP
Jupyter Notebook: Getting started
Do the following before proceeding any further:
$ git clone firstname.lastname@example.org:neomatrix369/nlp-java-jvm-example.git or $ git clone https://github.com/neomatrix369/nlp-java-jvm-example.git $ cd nlp-java-jvm-example
Then, see the Getting started section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.
Also, we have chosen the JDK to be GraalVM by default. You can see this from these lines in the console messages:
<---snipped--> JDK_TO_USE=GRAALVM openjdk version "11.0.5" 2019-10-15 OpenJDK Runtime Environment (build 11.0.5+10-jvmci-19.3-b05-LTS) OpenJDK 64-Bit GraalVM CE 19.3.0 (build 11.0.5+10-jvmci-19.3-b05-LTS, mixed mode, sharing) <---snipped-->
Note: a Docker image has been provided to be able to run a Docker container that would contain all the tools you need. You can see the shared folder has been created, which is linked to the volume mounted into your container, mapping your folder from the local machine. So, anything created or downloaded into the shared folder will be available even after you exit your container!
Running the Jupyter Notebook Container
See Running the Jupyter notebook container section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.
All you need to do is run this command after cloning the repo mentioned in the links above:
$ ./docker-runner.sh --notebookMode --runContainer
Once you have the above running, the action will automatically open load the Jupyter Notebook interface for you in a browser window. You will have a couple of notebooks to choose from (placed in the shared/notebooks folder on your local machine):
Installing Apache OpenNLP in the container
When inside the container in the notebook mode, you have two approaches to install Apache OpenNLP:
- From the command-line interface (optional): See the From the command-line interface sub-section under the Installing Apache OpenNLP in the container section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.
- From inside the Jupyter notebook (recommended): See From inside the Jupyter notebook sub-section under the Installing Apache OpenNLP in the container section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.
Viewing and Accessing the Shared Folder
See the Viewing and accessing the shared folder section in the Exploring NLP concepts from inside a Java-based Jupyter notebook part of the README before proceeding further.
This will also be covered in a small way via the Jupyter notebooks in the following section. You can see directory contents via the %system Java cell magic and then from the command prompt a similar files/folders layout will be displayed.
Performing NLP Actions in Jupyter Notebook
While you have the notebook server running, you will see this launcher window with a list of notebooks and other supporting files that show up as soon as the notebook server launches:
Each of the notebooks above has a purpose. MyFirstJupyterNLPJavaNotebook.ipynb shows how to write Java in an IPython notebook and perform NLP actions using Java code snippets that invoke the Apache OpenNLP library. (See the docs for more details on the classes and methods and the Java Docs for more details on the Java API usages).
The other notebook, MyNextJupyterNLPJavaNotebook.ipynb, runs the same Java code snippets on a remote cloud instance (with the help of the Valohai CLI client) and returns the results in the cells, with just single commands. It’s fast and free to create an account and use within the free-tier plan.
We are able to examine the below Java API bindings to the Apache OpenNLP library from inside both the Java-based notebooks:
- Language Detector API.
- Sentence Detection API.
- Tokenizer API.
- Name Finder API (including other examples).
- Parts of speech (POS) Tagger API.
- Chunking API.
- Parsing API.
Exploring the Apache OpenNLP Java APIs
We are able to do this from inside a notebook by running the IJava Jupyter interpreter, which allows writing Java in a typical notebook. We will be exploring the previously mentioned Java APIs using small snippets of Java code and see the results appear in the notebook:
So, go back to your browser and look for the MyFirstJupyterNLPJavaNotebook.ipynb notebook and have a look at it. Try reading and executing each cell and see the responses.
Exploring OpenNLP Java APIs With Remote Cloud Services
We are able to do this from inside a notebook. You can run the IJava Jupyter interpreter, which allows you to write Java in a typical notebook. But, in this notebook, we have taken it further and used the %system Java cell magic and the Valohai CLI magic instead of running the Java code snippets in the various cells like the previous notebook.
In this way, downloading models and processing text using the model does not happen on your local machine, but on a more sophisticated remote server in the cloud. You are also able to control this process from inside the notebook cells. This is more relevant when the models and the datasets to process are large and your local instance(s) do not have the necessary resources to support long-standing NLP processes. I have seen NLP training and evaluations to take long to finish, so high-spec resources are a must.
Again, go back to your browser and look for the MyNextJupyterNLPJavaNotebook.ipynb notebook and have a go with it. Try reading and executing each cell. All the necessary details are in there including links to the docs and supporting pages.
To get a deeper understanding of how these two notebooks were put together and how they work operationally, please have a look at all the source files.
Closing Jupyter Notebook
Make sure you have saved your notebook before you do this. Switch to the console window from where you ran the docker-runner shell script. Pressing Ctrl-C in the console running the Docker container gives you this:
<---snipped---> [I 21:13:16.253 NotebookApp] Saving file at /MyFirstJupyterJavaNotebook.ipynb ^C [I 21:13:22.953 NotebookApp] interruptedServing notebooks from local directory: /home/jovyan/work1 active kernel The Jupyter Notebook is running at:http://1e2b8182be38:8888/ Shutdown this notebook server (y/[n])? y[C 21:14:05.035 NotebookApp] Shutdown confirmed [I 21:14:05.036 NotebookApp] Shutting down 1 kernel Nov 15, 2019 9:14:05 PM io.github.spencerpark.jupyter.channels.Loop shutdown INFO: Loop shutdown. <--snipped--> [I 21:14:05.441 NotebookApp] Kernel shutdown: 448d46f0-1bde-461b-be60-e248c6342f69
This shuts down the container, and you are back to your local machine's command prompt. Your notebook stays preserved in the shares/notebooks folder on your local machine, provided you have been saving them as you kept changing them.
Other Concepts, Libraries, and Tools
There are other Java/JVM based NLP libraries mentioned in the Resources section below. For brevity, we won’t cover them. The links provided will lead to further information for your own pursuit.
Within the Apache OpenNLP tool itself, we have only covered the command line access part of it and not the Java Bindings. In addition, we haven’t gone through all the NLP concepts or features of the tool. Again, for brevity, we have only covered a handful of them. But the documentation and resources on the GitHub repo should help in further exploration.
This has been a very different experience than most of the other ways of exploring and learning, and you can see why the whole industry(Academia, Research, Data Science, and Machine Learning) have taken to ths approach like a storm. We still have limitations, but, with time, even they will be overcome, making our experience a smooth one.
IJava (Jupyter interpreter)
- %system Java cell magic implementation.
- Docker image with IJava + Jupyhai + other dependencies.
- Version Control for Jupyter Notebooks.
- Valohai’s Jupyter Notebook Extension.
- Automatic Version Control Meets Jupyter Notebooks.
- Run Jupyter Notebook On Any Cloud Provider.
- nlp-java-jvm-example GitHub project.
- Models page.
- Language Detect model.
- Older models to support the examples in the docs.
Opinions expressed by DZone contributors are their own.