5 Minute ML: Word Embedding
Let's take a look at word embedding as well as briefly explore one-hot encoding and using a pre-trained model.
Join the DZone community and get the full member experience.
Join For FreeNatural Language Processing has gathered quite the attention in recent times thanks to machine translation, chatbots, SIRI and Alexa. While a lot of people are processing text data, a fundamental change that has occurred in the last 5 years is the way we process text data.
One-Hot Encoding
In order to map text data into a form that an algorithm can process, we transformed the data into numbers using one-hot encoding. What we basically do here, we take our vocabulary of say N size and map this vocabulary into a set of N vectors of 1s and 0s, each of size N. Every vector represents a word. The ith-word has a 1 in the vector at the ith position and rest of vector is 0. For example, if our vocabulary consists of say four words, “Computer,” “Machine,” “Learning,” “Language,” then we will represent this set of text using 4 vectors of size 4 – [1000], [0100], [0010] and [0001]. 5 years back, this was pretty much how we processed text data before applying any algorithm.
Word Embedding
One major shortcoming that the one-hot encoding technique had was that every word had a separate identity. There was no way to figure out any correlation between words, figure out any similarities or relationships. Around a decade back Yoshua Bengio proposed a new way to represent vectors. But the representation really caught the fancy of the machine learning world when Tomas Mikolov published his work on Word2Vec in 2013.
Word embedding is based on the premise that words that are used and that occur in the same contexts tend to purport similar meanings. So, we would ideally like similar words to have similar vectors. Word2Vec not only created similar vectors for similar words, it also showed how simple computations can be carried out on such vectors. A good example would be king – man + woman = queen. Since then GloVe created by the Stanford NLP Group and FastText from Facebook have also been used extensively for text processing. Multiple pre-trained word embeddings are freely available from each of these three sources and they can be downloaded and used for transfer learning if your text data deals with open domain text data like news corpus or twitter data.
Using a Pre-Trained Model
import gensim
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
#get the vector of any word in Google News Corpus
vector = model['London']
print(vector)
#get words similar to a given word
print("similar: ", model.most_similar('London'))
Create your own word embedding
import gensim
from gensim.models import Word2Vec
sentences = [['London', 'is', 'the', 'capital', 'of', 'England'], ['Paris', 'is', 'the', 'capital', 'of', 'France']]
# train word2vec on the two sentences
model = Word2Vec(sentences, min_count=1)
print(model['capital'])
print(model['capital'].shape)
The code to create your own word2vec model can be as simple as above 5 lines of code. Of course for your own dataset, you need to read the data, clean it up, tokenize it and then store it in the form of a list of lists as shown above in the variable sentences. We will discuss the preprocessing part in upcoming articles.
Opinions expressed by DZone contributors are their own.
Comments