Over a million developers have joined DZone.

Semantic Search with Solr and NumPy

DZone's Guide to

Semantic Search with Solr and NumPy

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. What’s missing? — Semantic Search!

Consider three, somewhat silly documents:

  1. Yellow banana peels.
  2. A banana is a long yellow fruit.
  3. This mystery fruit is long and yellow and has a peel.

Now what happens if you search for the term “banana?" Under normal circumstances you only get back the first and second document. But why shouldn’t you also get back the third document? It’s obviously talking about bananas!

Semantic Search via Collaborative Filtering

Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded $1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you haven’t watched and b) you are likely to enjoy.

So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same!

Here’s how the process works:

  • First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF).
  • Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed.
  • Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search.

Demo Time!

So let’s consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because we’re all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers.

So, now that we’ve decided upon our demo dataset, we’re ready run the analysis. If you’d like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However I’m going execute everything from within Python:

>>> from SemanticAnalyzer import *
>>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000)
>>> tdc = TermDocCollection(source=stvc,numTopics=150) 

That last line takes a few minutes. If it’s in the AM where you are, grab a coffee. If it’s in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now let’s play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Let’s give it an easy test; here are the 30 most highly correlated words with the word ‘vader’ (as in Darth Vader).

>>> tdc.getRelatedTerms('vader',30)

Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly.

vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan

Hey, not bad! Everything here seems very reasonably connected with Mr. Vader. You may notice some odd spellings here; that’s because these are the indexed terms, therefore they are stemmed. Let’s try again with a different term; this time everyone’s favorite wizard:

>>> tdc.getRelatedTerms('potter',30)

harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord

Again, pretty good! One last try, and we’ll make it a little more challenging – a vague adjective:

>>> tdc.getRelatedTerms('dark',30)

dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort

Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter.

Now since the word correlation has proven itself out, it’s time to generate the pseudo terms and post them back to Solr.

>>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1)

This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once it’s done, you can start issuing searches to Solr.

Solr Results

Here’s an example of Semantic Search using Solr:

http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark

The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back:

Body: " In the John Carter movie (2012), he shows off some of his powers, like 
jumping abnormally high, but I have difficulty evaluating his strength. On the one 
side, he shows great strength, as when he kills a thark warrior with one hand, but 
he is also quite mistreated by them. He also seems helpless when he is strangled 
by Tars Tarkas. Why does the strength he shows seem so inconsistent? ",
BodyBlurred: "tv great movi control kill consid hand dark side power long mutant 
fight machin light abil sauron wormtail hulk"
Body: " In the movies, the Nazgul ride black horses with armour. I was wondering 
if that is all they are, or do they have some sort of magic? Are they evil? ",
BodyBlurred: "movi black magic dark demon engin hous aveng slytherin"
Body: " The remaining Black Brother from the prologue of A Game of Thrones is 
apparently the deserter who is beheaded in the beginning of the book. But how did 
he manage to get to Winterfell from the other side of The Wall? Or did the show 
throw me off track and in the book there weren't any survivors, so the deserter is 
someone else? ",
BodyBlurred: "book watch black hole dark side plai long game demon engin light 
turn district"
Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ",
BodyBlurred: "episod dark side light"

Not bad – most of those topics are rather … dark. Though check out that last result. So … maybe there are still some improvements we can make! But you also have to remember that we’re dealing with word correlation here, and I can only guess that somewhere else in the corpus, dark side-plots and dark episodes were surely discussed.

Speaking of word correlations, check out this gem:

Body: " You're correct, Enterprise is the only Star Trek that fits into both the 
original and the new 2009 movie timelines. From the perspective of the Enterprise 
characters, both are possible futures, given the over-arcing conceit of the show 
was a Temporal Cold War, so its future is in flux and could line up with either of 
the timelines we're familiar with, or with an entirely different future. ",
BodyBlurred: "answer charact place klingon star trek design travel crew watch work 
movi happen enterpris featur futur exist origin 2009 chang altern timelin war to 
version event captain gener pictur tng creat iii galaxi theori return alter voyag 
entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero 
centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha"

The original document involves Star Trek and time travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms … but do you see anything funny? That’s right, Biff, Doc and Marti made their way into the pseudo terms, likely because of their role in the popular time-travel film “Back to the Future.”

Speaking of the future …

Future Work

Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least three related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Cloudera’s recent foray into Solr with Mark Miller, they will also be working on the same thing.

What’s next for our work? Recommendations! Remember, that’s how we started this conversation. E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalog (e.g., product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receive results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy!

Do you have a good case for Solr Semantic Search and Recommendation? We’d love to hear it, please contact us!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}