Google's been doing some work over the last few years on the representation of words (and their meaning) as vectors. Here is one of the foundation papers "Efficient Estimation of Word Representations inVector Space" done at Google. Tomas Mikolov figures significantly in much of the work on representation. And if you go to his semantic scholar link from the link above you will find quite a few related papers (be prepared they are deeply theoretical and highly mathematical). But most of this work led to a project that has gotten some press lately:
This project is a tool that computes continuous distributed representations of words (the word vectors). But, if you're going to build your own vectors be prepared for the required corpus size and compute service requirements. After all, this is Google and they bandy about terms like "hundreds of billions words in the corpus" and "using many cores on many machines in your data center". But they have created some demos if you want to play with the vectors they created from their corpus (which came from Wikipedia), and they have included some of those in the project.
The vector learning portion of the system generates a vector for each word that has the dimensionality of the number of "important" words in the corpus. In other words tens of thousands of vectors each with hundreds of dimensions, so you won't be plotting these examples on graph paper.
But the fascinating part of these resultant vectors is how you can manipulate them in subtle ways. One example they give is:
vRome = vParis - vFrance + vItaly
vQueen = vKing - vMan + vWoman
The process involves adding and/or subtracting all of the vectors on the right side of the equation which results in a new vector, then finding the word that is closest to that resulting vector: Simple and elegant.
And, since each word is a vector it's possible to use a method such as K-means to cluster the words into related groups.
One of the limitations (I would say weakness) is that the neural net training of the vectors does not lead to a human understandable account of the knowledge embedded in them. And this limitation is a part of any neural net solution: it is a trained solution not a reasoned solution. And because of the breadth and depth of the neural nets required for suitable performance, this approach also requires a vast amount of data. (Again, before I get any hate mail, this is a factual observation and not a condemnation of neural nets.) So, as a result, there is no obvious way to bootstrap this kind of learning mechanism for a new domain that has few examples.
The project seems straightforward and easy to understand so perhaps someday we will see an actual micro service API from Google? Something to go along with SyntaxNet?
Playing around with meaning seems to add up. I'm ready for some antics with semantics!