Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

H2O Word2Vec Tutorial With Example in Scala

DZone's Guide to

H2O Word2Vec Tutorial With Example in Scala

Word2Vec is a method of feeding words into machine learning models. In this code-heavy tutorial, learn how to use its algorithm to build such models.

· AI Zone ·
Free Resource

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

In this Scala example, we will use the H2O Word2Vec algorithm to build a model using the given text (as a text file or as an array) and then build a Word2Vec model from it.

If you would like to know what Word2Vec is and why you should use it, there is lots of material available. You can learn more about H2O implementation of Word2Vec here, along with its configuration and interpretation.

Here is the full Scala code of the following example at my GitHub.

Let's start the H2O cluster first:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)

Now, we will be importing required libraries to get our job done:

import scala.io.Source
import _root_.hex.word2vec.{Word2Vec, Word2VecModel}
import _root_.hex.word2vec.Word2VecModel.Word2VecParameters
import water.fvec.Vec

Next, we will be creating a "stop words" list of words that are not useful for text mining and have them removed from the word source:

val STOP_WORDS = Set("ourselves", "hers", "between", "yourself", "but", "again", "there", "about", 
    "once", "during", "out", "very", "having", "with", "they", "own", "an", "be", "some", "for", "do", 
    "its", "yours", "such", "into", "of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as", 
     "from", "him", "each", "the", "themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were", "her", 
    "more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up", 
    "to", "ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have", "in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can", 
    "did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself", "which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by", "doing", 
    "it", "how", "further", "was", "here", "than")

Let's ingest the text data. We want to run Word2Vec algorithms to vectorize the data first and then run a machine learning experiment on it.

I have downloaded a free story, The Adventure of Sherlock Holmes, from the internet, and am using that as my source.  

val filename = "/Users/avkashchauhan/Downloads/TheAdventuresOfSherlockHolmes.txt"
val lines = Source.fromFile(filename).getLines.toArray
val sparkframe = sc.parallelize(lines)

Let's define the tokenize function, which will convert out input text to tokens:

def tokenize(line: String) = {
 //get rid of nonWords such as punctuation as opposed to splitting by just " "
 line.split("""\W+""")
 .map(_.toLowerCase)

//Lets remove stopwords defined above
 .filterNot(word => STOP_WORDS.contains(word)) :+ null
}

Now, we will be calling the tokenize function to create a list of labeled words:

val allLabelledWords = sparkframe.flatMap(d => tokenize(d))

Note: You can also use your own or a custom tokenize function from a library; you just need to map the function to the DataFrame.

Convert the collection of labeled words into an H2O DataFrame:

val h2oFrame = h2oContext.asH2OFrame(allLabelledWords)

It's finally time to use the H2O Word2Vec algorithm. Configure the parameters first:

val w2vParams = new Word2VecParameters
w2vParams._train = h2oFrame._key
w2vParams._epochs = 500
w2vParams._min_word_freq = 0
w2vParams._init_learning_rate = 0.05f
w2vParams._window_size = 20
w2vParams._vec_size = 20
w2vParams._sent_sample_rate = 0.0001f

Now, we will perform the real action of building the model:

val w2v = new Word2Vec(w2vParams).trainModel().get()

Now we can apply the model to perform some actions on it.

Let's start the first test by finding synonyms using the given Word2Vec model. We will be calling the findSynonyms method by passing a given word to find N synonyms. The results will be the top count synonyms with their distance values:

w2v.findSynonyms("love", 3)
w2v.findSynonyms("help", 2)
w2v.findSynonyms("hate", 1)

Let's transform words using the W2V model and aggregate the method average.

The transform() function takes an H2O vector as the first parameter, where the vector needs to be extracted from the H2O frame.

val newSparkFrame = w2v.transform(h2oFrame.vec(0), Word2VecModel.AggregateMethod.NONE).toTwoDimTable()

And that's it. Enjoy!

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Topics:
h2o ,machine learning ,scala ,ai ,tutorial ,algorithm

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}