Over a million developers have joined DZone.

This Week in Hadoop and More: NLP and DL

DZone's Guide to

This Week in Hadoop and More: NLP and DL

Natural Language processing and various Deep Learning libraries combined with Spark.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Many of the interesting new libraries coming out are in Python.   So I suggest you get at least Python 2.7 (or Python 3) installed and have PIP available to install cool parsing, ML and DL libraries.

 pip install -U spacy 

The one thing I will warn you is that a lot of deep learning libraries and NLP libraries include large training datasets that could fill up gigabytes or more of space on your harddrive.


One excellent use of NLP is to identify names in a corpus of text, say a lot of tweets you have stored in your data lake or a huge collection of corporate documents.   See this article I wrote on the topic.

Quick Sentiment Analysis in a Few Lines of Python

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys

sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
elif ss['compound'] < 0.00:
print ('Negative')

Deep Learning

Caffe is another great deep learning library, this one has support from Yahoo and others.   For using the ever possible ImageNet, check this out.  A web demo of interfacing with Caffe.  There are so many flavours of Deep Learning, I am hoping Keras will help unify them all.   My odds on favorites and TensorFlow and DeepLearning4J dominating due to the ecosystems, community, backers, quality, mind share and Keras.   It's nice to have Microsoft and Google competiting to see who can provide the best open source libraries!!!   I am hoping many of these libraries will move into Apache and get unified under one banner.   Imagine all those developers, scientists working on one unified framework, algorithms, models and documentation.   Skynet in 2 years...   Pretrained model zoos are awesome, but I think Pet Clone Farm sounds better as pets are pretrained.   You can take animals from the zoo and they are not pretrained.

Speaking of Model Zoos

Must-Watch Presentations to Start Your Year

Using GPUs Within SPARK

IBM has a few interesteing enhancements to Spark to allow usage of GPU processing power.   GPUs are becoming extremely useful for processing machine learning, deep learning and number crunching jobs.

Deep Learning Resources

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

hadoop ,spark ,machine learning ,nlp ,deep learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}