When we hear about machines getting “smarter,” learning to “understand” human language, it can seem like cause for concern. Will they understand everything we say? Will they be smarter than us? Will we lose control of them? We might feel we need to balance our desire for better apps (e.g. a better Siri) with concerns like this. This is not actually the case. When machines are “being smart” they are always just performing fancy math on large amounts of data to answer very specific questions.
You may think that understanding analogies is something very human, but it has a mathematical analog (pun intended!) which can be employed to great effect, without there being any cause for alarm. I’m talking about a technique called word embeddings.
Word embeddings, also known as word vectors (or distributed word representations) are a relatively new technique in machine learning that is advancing how machines process natural languages, such as English. Researchers at Google released their version of this technique, called word2vec, in 2013. Since then, there’s been an explosion of interest in the idea. The mathematics behind the technique make it possible to meaningfully answer questions like “Who is the Miles Davis of country music?” or “What’s the Pappy Van Winkle of rum?” But the technique’s usefulness goes far beyond analogies — this general idea is even at the heart of the latest great advances in machine translation.
How does it work? Word embeddings represent the words of a language in geometric space, i.e., as numeric vectors, so that the mathematical relationship between the vectors captures semantic relationships between the words. Think of a vector as simply a list of numbers — so instead of one number for each word, you’ve got a list of maybe 100 numbers. And each place on the list is for representing one particular aspect of the meaning of the word, e.g., whether it’s a noun, or the extent to which it represents femaleness. To use a classic example, a good set of representations would capture the relationship “king is to man as queen is to woman” by ensuring that a particular mathematical relationship holds between the respective vectors (specifically, king - man + woman = queen).
The beauty of word embeddings is that they capture such meaning through relationships. Think of the different relationships the word “queen” has to the words “monarch,” “princess,” “woman,” “drag,” “king,” “bee.” With sufficient examples of usage, all of these relationships are captured in the embeddings. This is nothing short of a breakthrough in machine learning!
Google’s word2vec project uses neural networks to learn vector representations of words from massive amounts of example text, e.g., millions of Google News articles. In addition to word2vec, there’s now also doc2vec, for representing entire documents as vectors, tweet2vec, and even emoji2vec.
One of the remarkable things about this method is that mathematical relationships such as the one illustrated above were not programmed in, they simply emerged in the word2vec representations as the system consumed more and more examples of how words are actually used.
Analogies involving concrete objects like kings and queens aren’t the only ones learned by these representations. They can also tell you that “ran” is to “run” as “spoke” is to “speak,” or that “short” is to “shorter” as “far” is to “farther.” No rules are needed here (with their unending lists of exceptions for irregular verbs and the like), just dimensions in vector space, e.g., a dimension for tense to get from the present tense to the past tense, a dimension for degree to get from an adjective to its comparative. Sure, having a rule about appending -er to an adjective is useful in understanding made-up words like “awesomer,” but it’s utterly useless for the ordinary word “good” (“gooder?”).
Word embeddings can be deployed very effectively in a wide range of tasks. Content recommendations are one example. If a visitor to your web site is viewing an item with the description “great for chopping vegetables,” and you want to recommend similar products, it should identify one with the description “ideal for slicing carrots” as similar, even though the words are different. In vector space, the words “chopping” and “slicing” are close, as are “vegetables” and “carrots.” This ability to identify texts as similar, even though the words are not the same, is also useful for identifying duplicate posts, e.g., on a forum, or simply for information retrieval, where you want to match someone’s query, “show me things that are great for chopping vegetables” with all the relevant documents / items.
Word embeddings are one of the most cutting edge techniques in Artificial Intelligence, but the whole thing is simply about applying smart mathematics to lots and lots of data. Don’t worry, the machines aren’t actually understanding anything! They’re just connecting hidden dots in our language usage. And the really good news for those who don’t have enormous amounts of data at their disposal is that we can make use of pre-trained embeddings in our own downstream machine learning tasks, such as text classification, even when we’re working with a relatively small number of documents. Google’s word2vec embeddings can be found here, and a set of embeddings trained by a group at Stanford on all of 2014 Wikipedia are available here.