Applying NLP in Java: All From the Command-Line
Learn more about NLP in Java!
Join the DZone community and get the full member experience.
Join For FreeWe are all aware of machine learning tools and cloud services that work via the browser and give us an interface we can use to perform our day-to-day data analysis, model training, evaluation, and other tasks to various degrees of efficiencies.
But what would you do if you wanted to run these tasks on or from your local machine or infrastructure available in your organization? And, if these resources available do not meet the pre-requisites, to do decent end-to-end data science or machine learning tasks. That’s when access to a cloud-provider agnostic, deep learning management environment like Valohai can help. And to add to this, we will be using the free-tier that is accessible to anyone.
You may also like: How to Create Java NLP Apps Using Google NLP API
We will be performing the task of building a Java app, and then training and evaluating an NLP model using it, and we will do all of it from the command-line interface with less interaction between the available web interface — basically, it will be an end-to-end process all the way to training, saving, and the evaluation of the NLP model. And we won’t need to worry much about setting up any environments, configuring, or managing it.
Our Goals
We will learn to do a bunch of things in this post, specifically covering various levels of abstractions (in no particular order):
- How to build and run an NLP model on the local machine
- How to build and run an NLP model on the cloud
- How to build NLP Java apps that run on the CPU or GPU
- Most examples out there are non-Java based, much less Java-based ones
- Most examples are CPU based, much less on GPUs
- How to perform the above depending on the absence/presence of resources, i.e. GPU
- How to build a CUDA docker container for Java
- How to do all the above all from the command-line
- Via individual commands
- Via shell scripts
What Do We Need and How?
Here’s what we need to be able to get started:
- A Java app that builds and runs on any operating system
- CLI tools that allow connecting to remote cloud services
- Shell scripts and code configuration to manage all of the above
The how part of this task is not hard once we have our goals and requirements clear, we will expand on this in the following sections.
NLP for Java: DL4J
We have all of the code and instructions needed to get started with this post captured for you on GitHub. Below are the required steps to get acquainted with the project.
Quick Startup
To quickly get started, we need to do just these:
- Open an account on https://valohai.com, see https://app.valohai.com/accounts/signup/
- Install Valohai CLI on your local machine
- Clone the repo https://github.com/valohai/dl4j-nlp-cuda-example/
$ git clone https://github.com/valohai/dl4j-nlp-cuda-example/
$ cd dl4j-nlp-cuda-example
- Create a Valohai project using the Valohai CLI tool, and give it a name:
$ vh project create
- Link your Valohai project with the GitHub repo https://github.com/valohai/dl4j-nlp-cuda-example/ on the Repository tab of the Settings page (https://app.valohai.com/p/[your-user-id]/dl4j-nlp-cuda-example/settings/repository/)
$ vh project open
### Go to the Settings page > Repository tab and update the git repo address
### with https://github.com/valohai/dl4j-nlp-cuda-example/
- Update Valohai project with the latest commits from the git repo
$ vh project fetch
Now, you’re ready to start using the power of performing machine learning tasks from the command-line.
See Advanced installation and setup section in the README to find out what we need to install and configure on your system to run the app and experiments on your local machine or inside a Docker container — this is not necessary for this post at the moment but you can try it out at a later time.
About valohai.yaml
You will have noticed we have a valohai.yaml in the git repo and our valohai.yaml file contains several steps that you can use. We have enlisted them by their names, which we will use when running our steps:
- build-cpu-gpu-uberjar: build our uber jar (both CPU and GPU versions) on Valohai
- train-cpu-linux: run the NLP training using the CPU-version of uber jar on Valohai
- train-gpu-linux: run the NLP training using the GPU-version of uber jar on Valohai
- evaluate-model-linux: evaluate the trained NLP model from one of the above train-* execution steps
- know-your-gpus: run on any instance to gather GPU/Nvidia related details on that instance, we run the same script with the other steps above (both the build and run steps)
Building a Java App From the Command Line
Assuming you are all set up, we will start by building the Java app on the Valohai platform from the command prompt, which is as simple as running one of the two commands:
$ vh exec run build-cpu-gpu-uberjar [--adhoc]
### Run `vh exec run --help` to find out more about this command
Then you will be prompted with the execution counter, which is nothing by a number:
<--snipped-->
�� Success! Execution #1 created. See https://app.valohai.com/p/valohai/dl4j-nlp-cuda-example/execution/016dfef8-3a72-22d4-3d9b-7f992e6ac94d/

Note: use --adhoc
only if you have not setup your Valohai project with a git repo or have unsaved commits and want to experiment before being sure of the configuration.
You can watch your execution by:
$ vh watch 1
### the parameter 1 is the counter returned by the
### `vh exec run build-cpu-gpu-uberjar` operation above,
### it is the index to refer to that execution run
You can see either we are waiting for an instance to be allocated or console messages move past the screen when the execution has kicked off. You can see the same via the web interface as well.
Note: Instances are available based on how popular they are and also how much quota you have left on them. If they have been used recently, they are more likely to be available next.

Once the step is completed, you can see it results in a few artifacts, called outputs in the Valohai terminology. We can see them by:
$ vh outputs 1
### Run `vh outputs --help` to find out more about this command

We will need the URLs that look like datum://[....some sha like notation...]
for our next steps. You can see we have a log file that has captured the GPU-related information about the running instance. You can download this file by:
$ vh outputs --download . --filter *.logs 1
### Run `vh outputs --help` to find out more about this command
Running the NLP Training Process for CPU/GPU From the Command-Line
We will use the built artifacts, namely the uber jars for the CPU and GPU backends to run our training process:
### Running the CPU uberjar
$ vh exec run train-cpu-linux --cpu-linux-uberjar=datum://016dff00-43b7-b599-0e85-23a16749146e [--adhoc]
### Running the GPU uberjar
$ vh exec run train-gpu-linux --gpu-linux-uberjar=datum://016dff00-2095-4df7-5d9e-02cb7cd009bb [--adhoc]
### Note these datum:// link will vary in your case
### Run `vh exec run train-cpu-linux --help` to get more details on its usage
Note: Take a look at the Inputs with Valohai CLI docs to see how to write commands like the above.
We can watch the process if we like, but it can be lengthy, so we can switch to another task.
The above execution runs finish with saving the model into the ${VH_OUTPUTS}
folder to enable it to be archived by Valohai. The model names get suffix to their names, to keep a track of how they were produced.
At any point during our building, training, or evaluation steps, we can stop an ongoing execution (queued or running) by just doing this:
$ vh stop 3
(Resolved stop to execution stop.)
⌛ Stopping #3...
=> {"message":"Stop signal sent"}
�� Success! Done.
Downloading the Saved Model Post Successful Training
We can query the outputs
of execution by its counter number and download it using:
$ vh outputs 2
$ vh outputs --download . --filter Cnn*.pb 2

See how you can evaluate the downloaded model on your local machine, both the models created by the CPU- and GPU-based processes (respective uber jars). Just pass in the name of the downloaded model as a parameter to the runner shell script provided.
Evaluating the Saved NLP Model From a Previous Training Execution
### Running the CPU uberjar and evaluating the CPU-verion of the model
$ vh exec run evaluate-model-linux --uber-jar=datum://016dff00-43b7-b599-0e85-23a16749146e --model=datum://016dff2a-a0d4-3e63-d8da-6a61a96a7ba6 [--adhoc]
### Running the GPU uberjar and evaluating the GPU-verion of the model
$ vh exec run evaluate-model-linux --uber-jar=datum://016dff00-2095-4df7-5d9e-02cb7cd009bb --model=datum://016dff2a-a0d4-3e63-d8da-6a61a96a7ba6 [--adhoc]
### Note these datum:// link will vary in your case
### Run `vh exec run train-cpu-linux --help` to get more details on its usage
At the end of the model evaluation, we get the following model evaluation metrics and confusion matrix after running a test set on the model:

Note: the source code contains ML- and NLP-related explanations at various stages in the form of inline comments.
Capturing the Environment Information About Nvidia’s GPU and CUDA Drivers
This step is unrelated to the whole process of building and running a Java app on the cloud and controlling and viewing it remotely using the client tool. However, it is useful to be able to know what kind of system we ran our training on, especially for the GPU aspect of the training:
$ vh exec run know-your-gpus [--adhoc]
### Run `vh exec run --help` to get more details on its usage
Keeping Track of Your Experiments
While writing this post, I ran several experiments, and to keep track of the successful versus failed experiments in an efficient manner, I was able to use Valohai’s version control facilities baked into its design by:
- Filtering for executions
- Searching for specific execution by “token”
- Re-running the successful and failed executions
- Confirming that the executions were successful and a failure for the right reasons
- Also, checkout data-catalogues and data provenance on the Valohai platform below an example of my project (look for the Trace button):


Comparing the CPU- and GPU-Based Processes
We could have discussed comparisons between the CPU- and GPU- based processes in terms of these:
- App-building performance
- Model training speed
- Model evaluation accuracy
But we won’t cover these topics in this post, although you have access to the metrics you need for it, in case you wish to investigate further.
Necessary Configuration File(s) and Shell Scripts
All the necessary scripts can be found on the GitHub repo in:
- The root folder of the project
- Docker folder
- Resources-archive folder
Please also have a look at the README.md file for further details on their usages and other additional information that we haven’t mentioned in this post here.
Valohai — Orchestration
If we have noticed all the above tasks were orchestrating the tasks via a few tools at different levels of abstractions:
- Docker to manage infrastructure and platform-level configuration and version control management
- Java to be able to run our apps on any platform of choice
- Shell scripts to be able to, again, run both building and execution commands in a platform-agnostic manner and also be able to make exceptions for the absence of resources, i.e. GPU on MacOSX
- A client tool to connect with the remote cloud service, i.e. Valohai CLI, and view, control executions, and download the end results
You are orchestrating your tasks from a single point making use of the tools and technologies available to do various data and machine learning tasks.
Conclusion
We have seen that NLP is a resource-consuming task, and having the right methods and tools in hands certainly helps. Once again, the DeepLearning4J library from Skymind and the Valohai platform have come to our aid. Thanks to the creators of both platforms! Below, we can see the benefits (and more) this post provides.
Highlights
We gained a lot from the above experiments, including:
- Not having to worry about hardware and/or software configuration and version control management — docker containers FTW!
- Able to run manual one-off building, training, and evaluation tasks — Thanks, Valohai CLI tool!
- Automate regularly used tasks for your team to be able to run tasks on remote cloud infrastructure, thanks to infrastructure-as-code!
- Overcome the limitations of an old or slow machine or a mac with no access to the onboard GPU using CUDA-enabled docker image scripts.
- Overcome situations where not enough resources were available on the local or server infrastructure, and still be able to run experiments requiring high-throughput and performant environments — a cloud-provider agnostic platform, i.e. Valohai environments.
- Run tasks and not having to wait for them to finish and be able to run multiple tasks — concurrently and in-parallel on remote resources in a cost-effective manner — a cloud-provider agnostic platform, i.e. Valohai CLI tool.
- Remotely view and control both configuration and executions and even download the end-results after a successful execution — a cloud-provider agnostic platform, i.e. the Valohai CLI tool.
- And many others you will spot yourself!
Suggestions
- Using the provided CUDA-enabled docker container: highly recommend not to start installing Nvidia drivers or CUDA or cuDNN on your local machine (Linux or Windows-based) — shelve this for later experimentation
- Use provided shell scripts and configuration files: try not to perform manual CLI command, instead use shells scripts to automate repeated tasks, provided examples are a good starting point and take it further from there
- Try to learn as much: about GPUs, CUDA, cuDNN from resources provided and look for more (see Resources section at the bottom of the post)
- Use version control and infrastructure-as-code systems: git and the valohai.yaml are great examples of this!
I felt very productive and my time and resources were effectively used while doing all of the above. But above all, I can share it with others and everyone can reuse my hard work — just clone the repo and off you go!
What we didn’t cover, and what is potentially a great topic to talk about, is the Valohai Pipelines! Stay tuned for future posts!
Resources
- dl4j-nlp-cuda-example project on GitHub
- CUDA enabled docker container on Docker Hub (use the latest tag: v0.5)
- GPU, Nvidia, CUDA, and cuDNN
- Awesome AI/ML/DL resources
- Java AI/ML/DL resources
- Deep Learning and DL4J Resources
- Awesome AI/ML/DL: NLP resources
- DL4J NLP resources
- Language processing
- ND4J backends for GPUs and CPUs
- How the Vocab Cache Works
- Word2Vec, Doc2vec & GloVe: Neural Word Embeddings for Natural Language Processing
- Doc2Vec, or Paragraph Vectors, in Deeplearning4j
- Sentence iterator
- What is Tokenization?
- Examples
- https://github.com/eclipse/deeplearning4j-examples/tree/master/dl4j-examples
- https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nlp-parent
Valohai Resources
- valohai | docs | blogs | GitHub | Videos | Showcase | About valohai | Slack | @valohaiai
- Search for any topic in the Documentation
- Blog posts on how to use the Valohai CLI tool: [1] | [2]
- Custom Docker Images
Other Resources
This blog post was originally published on https://blog.valohai.com.
Published at DZone with permission of Mani Sarkar. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Adding Mermaid Diagrams to Markdown Documents
-
Observability Architecture: Financial Payments Introduction
-
Integrating AWS With Salesforce Using Terraform
-
Five Java Books Beginners and Professionals Should Read
Comments