DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. 5 Minute ML: Word Embedding

5 Minute ML: Word Embedding

Let's take a look at word embedding as well as briefly explore one-hot encoding and using a pre-trained model.

Shrikant Kashyap user avatar by
Shrikant Kashyap
·
Oct. 20, 18 · Tutorial
Like (1)
Save
Tweet
Share
6.30K Views

Join the DZone community and get the full member experience.

Join For Free

Natural Language Processing has gathered quite the attention in recent times thanks to machine translation, chatbots, SIRI and Alexa. While a lot of people are processing text data, a fundamental change that has occurred in the last 5 years is the way we process text data.

One-Hot Encoding

In order to map text data into a form that an algorithm can process, we transformed the data into numbers using one-hot encoding. What we basically do here, we take our vocabulary of say N size and map this vocabulary into a set of N vectors of 1s and 0s, each of size N. Every vector represents a word. The ith-word has a 1 in the vector at the ith position and rest of vector is 0. For example, if our vocabulary consists of say four words, “Computer,” “Machine,” “Learning,” “Language,” then we will represent this set of text using 4 vectors of size 4 – [1000], [0100], [0010] and [0001]. 5 years back, this was pretty much how we processed text data before applying any algorithm.

Word Embedding

One major shortcoming that the one-hot encoding technique had was that every word had a separate identity. There was no way to figure out any correlation between words, figure out any similarities or relationships. Around a decade back Yoshua Bengio proposed a new way to represent vectors. But the representation really caught the fancy of the machine learning world when Tomas Mikolov published his work on Word2Vec in 2013.

Word embedding is based on the premise that words that are used and that occur in the same contexts tend to purport similar meanings. So, we would ideally like similar words to have similar vectors. Word2Vec not only created similar vectors for similar words, it also showed how simple computations can be carried out on such vectors. A good example would be king – man + woman = queen. Since then GloVe created by the Stanford NLP Group and FastText from Facebook have also been used extensively for text processing. Multiple pre-trained word embeddings are freely available from each of these three sources and they can be downloaded and used for transfer learning if your text data deals with open domain text data like news corpus or twitter data.

Using a Pre-Trained Model

import gensim
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
#get the vector of any word in Google News Corpus
vector = model['London']
print(vector)
#get words similar to a given word
print("similar: ", model.most_similar('London'))

Create your own word embedding

import gensim
from gensim.models import Word2Vec

sentences = [['London', 'is', 'the', 'capital', 'of', 'England'], ['Paris', 'is', 'the', 'capital', 'of', 'France']]
# train word2vec on the two sentences
model = Word2Vec(sentences, min_count=1)
print(model['capital'])
print(model['capital'].shape)

The code to create your own word2vec model can be as simple as above 5 lines of code. Of course for your own dataset, you need to read the data, clean it up, tokenize it and then store it in the form of a list of lists as shown above in the variable sentences. We will discuss the preprocessing part in upcoming articles.

Machine learning

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Java Version Are You Running? Let’s Take a Look Under the Hood of the JDK!
  • Choosing the Best Cloud Provider for Hosting DevOps Tools
  • Project Hygiene
  • Using QuestDB to Collect Infrastructure Metrics

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: