Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Gensim – Vectorizing Text and Transformations

DZone's Guide to

Gensim – Vectorizing Text and Transformations

Let's take a look at what Gensim is and look at what vectors are and why we need them. Also look at the bag-of-words model and TF-IDF.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

In this article, we will discuss vector spaces and the open source Python package Gensim. Here, we'll be touching the surface of Gensim's capabilities. This article will introduce the data structures largely used in text analysis involving machine learning techniques — vectors.

Introducing Gensim

When we talk about representations and transformations in this article, we will be exploring different kinds of ways of representing our strings as vectors, such as bag-of-words, TF-IDF (term frequency-inverse document frequency), LSI (latent semantic indexing) and the more recently popular word2vec. The transformed vectors can be plugged into scikit-learn Machine Learning methods just as easily. Gensim started off as a modest project by Radim Rehurek and was largely the discussion of his Ph.D. thesis, Scalability of Semantic Analysis in Natural Language Processing. It included novel implementations of Latent Dirichlet allocation (LDA) and Latent Semantic Analysis among its primary algorithms, as well as TF-IDF and Random Projection implementations. It has since grown to be one of the largest NLP/information retrieval Python libraries, and is both memory-efficient and scalable, as opposed to the previous largely academic code available for semantic modeling (for example, the Stanford Topic Modelling Toolkit).

Gensim manages to be scalable because it uses Python's built-in generators and iterators for streamed data-processing, so the data-set is never actually completely loaded in the RAM. Most IR algorithms involve matrix decompositions - which involve matrix multiplications. This is performed by NumPy, which is further built on FORTRAN/C, which is highly optimized for mathematical operations. Since all the heavy lifting is passed on to these low-level BLAS libraries, Gensim offers the ease-of-use of Python with the power of C.

The primary features of Gensim are its memory-independent nature, multi-core implementations of Latent Semantic Analysis, Latent Dirichlet Allocation, Random Projections, Hierarchical Dirichlet Process (HDP) and word2vec deep learning, as well as the ability to use LSA and LDA on a cluster of computers. It also seamlessly plugs into the Python scientific computing ecosystem and can be extended with other vector space algorithms. Gensim's directory of Jupyter Notebooks serves as an important documentation source, with its tutorials covering most of that Gensim has to offer. Jupyter Notebooks are a useful way to run code on a live server — the documentation page is worth having a look at!

The tutorials page can help you with getting started with using Gensim, but the coming sections will also describe how to get started with using Gensim, and about how important role vectors will play in the rest of our time exploring machine learning and text processing.

Vectors and Why We Need Them

We're now moving towards the machine learning part of text analysis - this means that we will now start playing a little less with words and a little more with numbers. Even when we used spaCy, the POS-tagging and NER-tagging, for example, was done through statistical models - but the inner workings were largely hidden for us — we passed over Unicode text and after some magic, we have annotated text.

For Gensim however, we're expected to pass vectors as inputs to the IR algorithms (such as LDA or LSI), largely because what's going on under the hood is mathematical operations involving matrices. This means that we have to represent what was previously a string as a vector, and these kinds of representations or models are called Vector Space Models.

From a mathematical perspective, a vector is a geometric object which has magnitude and direction. We don't need to pay as much attention to this, and rather think of vectors as a way of projecting words onto a mathematical space while preserving the information provided by these words.

Machine learning algorithms use these vectors to make predictions. We can understand machine learning as a suite of statistical algorithms and the study of these algorithms. The purpose of these algorithms is to learn from the provided data by decreasing the error of their predictions. As such, this is a wide field - we will be explaining particular machine learning algorithms as and then they come up.

Let us meanwhile discuss a couple of forms of these representations.

Bag-of-Words

The bag-of-words model is arguably the most straightforward form of representing a sentence as a vector. Let's start with an example:

S1: "The dog sat by the mat."

S2: "The cat loves the dog."

If we follow the pre-processing steps using spaCy's language model, we will end up with the following sentences:

S1: "dog sat mat."

S2: "cat love dog."

As Python lists, these will now look like:

S1: ['dog', 'sat', 'mat']

S2: ['cat', 'love', 'dog']

If we want to represent this as a vector, we would need to first construct our vocabulary, which would be the unique words found in the sentences. Our vocabulary vector is now:

Vocab = ['dog', 'sat', 'mat', 'love', 'cat']

This means that our representation of our sentences will also be vectors with a length of 5 - we can also say that our vectors will have 5 dimensions. We can also think of a mapping of each word in our vocabulary to a number (or index), in which case we can also refer to our vocabulary as a dictionary.

The bag-of-words model involves using word frequencies to construct our vectors. What will our sentences now look like?

S1: [1, 1, 1, 0, 0]

S2: [1, 0, 0, 1, 1]

It's easy enough to understand - there is 1 occurrence of dog, the first word in the vocabulary, and 0 occurrences of love in the first sentence, so the appropriate indexes are given the value based on the word frequency. If the first sentence has 2 occurrences of the word dog, it would be represented as:

S1: [2, 1, 1, 0, 0]

This is just an example of the idea behind a bag of words representation - the way Gensim approaches bag of words is slightly different, and we will see this in the coming section. One important feature of the bag-of-words model which we must remember is that it is an order less document representation — only the counts of the words matter. We can see that in our example above as well, where by looking at the resulting sentence vectors we do not know which words came first. This leads to a loss in spatial information, and by extension, semantic information. However, in a lot of information retrieval algorithms, the order of the words is not important, and just the occurrences of the words are enough for us to start with.

An example where the bag of words model can be used is in spam filtering - emails which are marked as spam are likely to contain spam-related words, such as buy, money, stock, and so on. By converting the text in emails into a bag of words models, we can use Bayesian probability to determine if it is more likely for a mail to be in the spam folder or not. This works because like we discussed before, in this case, the order of the words is not important, just whether they exist in the mail or not.

TF-IDF

TF-IDF is short for term frequency-inverse document frequency. Largely used in search engines to find relevant documents based on a query, it is a rather intuitive approach to converting our sentences into vectors.

As the name suggests, TF-IDF tries to encode two different kinds of information: term frequency and inverse document frequency. Term frequency (TF) is the number of times a word appears in a document.

IDF helps us understand the importance of a word in a document. By calculating the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term) and then taking the logarithm of that quotient, we can have a measure of how common or rare the word is among all documents.

In case the above explanation wasn't very clear, expressing them as formulas will help!

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

TF-IDF is simply the product of these two factors - TF, and IDF. Together it encapsulates more information into the vector representation, instead of just using the count of the words like in the bag-of-words vector representation. TF-IDF makes rare words more prominent and ignores common words like is, of, that and so on, which may appear a lot of times but have little importance.

For more information on how TF-IDF works, especially with the mathematical nature of TF-IDF and solved examples, the Wikipedia page on TF-IDF is a good resource.

Other Representations

It's possible to extend these representations - indeed, topic models, which we will explore later, are one such example. Word Vectors are also an interesting representation of words, where we train a shallow neural network (a neural network with 1 or 2 layers) to describe words as vectors, where each feature is a semantic decoding of the word. We will be spending an entire chapter discussing word vectors, in particular, word2vec. To get a taste of what word vectors do, this blog post, The amazing power of word vectors is a good start.

Vector Transformations in Gensim

Now that we know what vector transformations are, let's get used to creating them, and using them. We will be performing these transformations with Gensim, but even scikit-learn can be used. We'll also have a look at scikit-learn's approach later on.

Let's create our corpus now. We discussed earlier that a corpus is a collection of documents. In our examples, each document would just be one sentence, but this is obviously not the case in most real-world examples we will be dealing with. We should also note that once we are done with pre-processing, we get rid of all punctuation marks - as for as our vector representation is concerned, each document is just one sentence.

Of course, before we start, be sure to install Gensim. Like spaCy, pip or conda is the best way to do this based on your working environment.

from gensim import corpora

documents = [u"Football club Arsenal defeat local rivals this weekend.", u"Weekend football frenzy takes over London.", u"Bank open for takeover bids after losing millions.", u"London football clubs bid to move to Wembley stadium.", u"Arsenal bid 50 million pounds for striker Kane.", u"Financial troubles result in loss of millions for bank.", u"Western bank files for bankruptcy after financial losses.", u"London football club is taken over by oil millionaire from Russia.", u"Banking on finances not working for Russia."]

Just a note - we make sure that all the strings are Unicode strings so that we can use spaCy for pre-processing.

import spacy

nlp = spacy.load("en")

texts = []

for document in documents:

                text = []

                doc = nlp(document)

                for w in doc:

                if not w.is_stop and not w.is_punct and not w.like_num:

            text.append(w.lemma_)

                texts.append(text)

print(texts)

We performed very similar pre-processing when we introduced spaCy. What do our documents look like now?

[[u'football', u'club', u'arsenal', u'defeat', u'local', u'rival', u'weekend'],

 [u'weekend', u'football', u'frenzy', u'take', u'london'],

 [u'bank', u'open', u'bid', u'lose', u'million'],

 [u'london', u'football', u'club', u'bid', u'wembley', u'stadium'],

 [u'arsenal', u'bid', u'pound', u'striker', u'kane'],

 [u'financial', u'trouble', u'result', u'loss', u'million', u'bank'],

 [u'western', u'bank', u'file', u'bankruptcy', u'financial', u'loss'],

 [u'london', u'football', u'club', u'take', u'oil', u'millionaire', u'russia'],

 [u'bank', u'finance', u'work', u'russia']]

Let's start by whipping up a bag-of-words representation for our mini-corpus. Gensim allows us to do this very conveniently through its dictionary class.

dictionary = corpora.Dictionary(texts)

print(dictionary.token2id)

{u'pound': 17, u'financial': 22, u'kane': 18, u'arsenal': 3, u'oil': 27, u'london': 7, u'result': 23, u'file': 25, u'open': 12, u'bankruptcy': 26, u'take': 9, u'stadium': 16, u'wembley': 15, u'local': 4, u'defeat': 5, u'football': 2, u'finance': 31, u'club': 0, u'bid': 10, u'million': 11, u'striker': 19, u'frenzy': 8, u'western': 24, u'trouble': 21, u'weekend': 6, u'bank': 13, u'loss': 20, u'rival': 1, u'work': 30, u'millionaire': 29, u'lose': 14, u'russia': 28}

There are 32 unique words in our corpus, all of which are represented in our dictionary with each word being assigned an index value. When we refer to a word's word_id henceforth, it means we are talking about the words integer-id mapping made by the dictionary.

We will be using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

corpus = [dictionary.doc2bow(text) for text in texts]

If we print our corpus, we'll have our bag of words representation of the documents we used.

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],

 [(2, 1), (6, 1), (7, 1), (8, 1), (9, 1)],

 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],

 [(0, 1), (2, 1), (7, 1), (10, 1), (15, 1), (16, 1)],

 [(3, 1), (10, 1), (17, 1), (18, 1), (19, 1)],

 [(11, 1), (13, 1), (20, 1), (21, 1), (22, 1), (23, 1)],

 [(13, 1), (20, 1), (22, 1), (24, 1), (25, 1), (26, 1)],

 [(0, 1), (2, 1), (7, 1), (9, 1), (27, 1), (28, 1), (29, 1)],

 [(13, 1), (28, 1), (30, 1), (31, 1)]]

This is a list of lists, where each individual list represents a documents bag-of-words representation. A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur. Unlike the example we demonstrated where an absence of a word was a 0, we use tuples which represent (word_id, word_count). We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list. We can also notice in this case each document has not greater than one count of each word — in smaller corpuses, this tends to happen.

And voila! Our corpus is assembled, and we are ready to work machine learning/information retrieval magic on them whenever we would like. But before we sink our teeth into it… let's spend some more time with some details regarding corpuses.

We previously mentioned how Gensim is powerful because it uses streaming corpuses. But in this case, the entire list is loaded into RAM. This is not a bother for us because it is a toy example, but in any real-world cases, this might cause problems. How do we get past this?

We can start by storing the corpus, once it is created, to disk. One way to do this is as follows:

corpora.MmCorpus.serialize('/tmp/example.mm', corpus)

By storing the corpus to disk and then later loading from disk, we are being far more memory efficient, because at most one vector resides in the RAM at a time. TheGensim tutorial on corpora and vector spaces covers a little more than what we discussed so far and may be useful for some readers.

Converting a bag of words representation into TF-IDF, for example, is also made very easy with Gensim. We first choose the model/representation we want from the Gensim models' directory.

from gensim import models

tfidf = models.TfidfModel(corpus)

This means that tfidf now represents a TF-IDF table trained on our corpus. Note that in case of TFIDF, the training consists simply of going through the supplied corpus once and computing document frequencies of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation, is much more involved and, consequently, takes much more time. We will explore those transformations on the chapters on Topic Modeling. It is also important to note that all such vector transformations require the same input feature space - which means the same dictionary (and of course, vocabulary).

So, what does a TF-IDF representation of our corpus look like? All we have to do is:

for document in tfidf[corpus]:

                print(document)

And this gives us:

[(0, 0.24046829370585293), (1, 0.48093658741170586), (2, 0.17749938483254057), (3, 0.3292179861221232), (4, 0.48093658741170586), (5, 0.48093658741170586), (6, 0.3292179861221232)]

[(2, 0.24212967666975266), (6, 0.4490913847888623), (7, 0.32802654645398593), (8, 0.6560530929079719), (9, 0.4490913847888623)]

[(10, 0.29592528218102643), (11, 0.4051424990000138), (12, 0.5918505643620529), (13, 0.2184344336379748), (14, 0.5918505643620529)]

[(0, 0.29431054749542984), (2, 0.21724253258131512), (7, 0.29431054749542984), (10, 0.29431054749542984), (15, 0.5886210949908597), (16, 0.5886210949908597)]

[(3, 0.354982288765831), (10, 0.25928712547209604), (17, 0.5185742509441921), (18, 0.5185742509441921), (19, 0.5185742509441921)]

[(11, 0.3637247180792822), (13, 0.19610384738673725), (20, 0.3637247180792822), (21, 0.5313455887718271), (22, 0.3637247180792822), (23, 0.5313455887718271)]

[(13, 0.18286519950508276), (20, 0.3391702611796705), (22, 0.3391702611796705), (24, 0.4954753228542582), (25, 0.4954753228542582), (26, 0.4954753228542582)]

[(0, 0.2645025265769199), (2, 0.1952400253294319), (7, 0.2645025265769199), (9, 0.3621225392416359), (27, 0.5290050531538398), (28, 0.3621225392416359), (29, 0.5290050531538398)]

[(13, 0.22867660961662029), (28, 0.4241392327204109), (30, 0.6196018558242014), (31, 0.6196018558242014)]

If you remember what we said about TF-IDF, you will be able to identify the float next to each word_id - it is the product of the TF and IDF scores for that particular word, instead of just the word count which was present before. The higher the score, the more important the word in the document.

We can use this representation as input for our ML algorithms as well, and we can also further chain or link these vector representations by performing another transformation on them.

You’ve read an excerpt from Packt Publishing's latest book Natural Language Processing and Computational Linguistics, authored by Bhargav Srinivasa-Desikan.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
data science ,natural language processing ,python ,keras ,spacy ,nlp ,artificial intelligence ,statistical learning ,data cleaning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}