Solr & Elasticsearch -- Modeling Signals to Build Real Semantic Search
Solr & Elasticsearch -- Modeling Signals to Build Real Semantic Search
When it comes to Solr and Elasticsearch, there are many misconceptions. Learn why you have to establish relevance when using Elasticsearch and Solr.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Perhaps the biggest relevance mistake you can make is to take content, straight from its source, and plop it directly into Elasticsearch or Solr. Trying to search fields that directly reflect attributes from your data's source indexed with default settings won’t give you great results.
Why is this? Unmodified relevance scores from field searches likely don’t correspond to meaningful business or domain ranking criteria. You need to get at criteria like "does a searched for actor star in this movie?", "is this restaurant nearby?", "is this product well reviewed?". Yet data from APIs, databases, and filesystems are often subdivided into attributes ill equiped to answer these questions with default search settings.
You almost always have to manipulate your data to answer these questions with a search engine. Luckily, with open source search, you are in control! You can to craft relevance scores to reflect needed criteria numerically. You control what terms end up in fields to form the basis of a relevance score. You deeply manipulate Lucene to control scoring and searching! When you understand this, you begin to express relevance in terms of quantifiable business criteria instead of surrendering to default scoring and field structure. You see your fields as built to be searched and scored to measure important relevance signals -- numbers that correspond meaningfully to valuable criteria such as “the likeliness the user is searching this type of restaurant”, “the proximity of the user to the restaurant”, “how well rated the restaurant is”, or “how much is the body text about the search terms” -- or whatever your users deem relevant.
What do I mean? Why can’t a search engine just figure out how to rank results by this criteria automatically? Let’s take text fields as an example. Recall the default text relevance score in Solr or Elasticsearch is TF*IDF. This means a strong bias towards fields that mention more of the search terms (TF or term frequency) multiplied by how rare those terms are in the corpus (IDF or inverse document frequency). TF*IDF has proven a reasonable, general-purpose measure of relevance for searching text.
Users have their OWN expectations that have nothing to do with how your database or source system organizes data! Failing to model signals they care about is the biggest failing of most search solutions.
However, the specifics of your use case may mean default TF*IDF is way off the mark for measuring the criteria you need. We saw this when we dug into title search in a previous article. Odd terms like “who” turned out to be strikingly rare in title fields. This greatly disrupted the results ranking, causing searches like “Who is Socrates” to return search results like “Who is Plato” ranked much higher than simple, on-topic, titles like “Socrates”. The rareness (IDF) of terms like “who” wasn’t helpful in creating a signal associated with “the article’s title describes topics in the user’s search terms”. Therefore, we had to eliminate many of these strikingly rare terms. This and a number of other measures turned title search from a rather dumb search into a smarter experience closely resembling perusing shelves in a book store to find a good book on “Socrates”.
In other words, we improved the quality of the information associated with the title field’s relevance score. We turned the relevance score into a signal that more precisely measures when “the article’s title describes topics in the user’s search terms” . This is criteria more meaningfully tied to what the user intends, and how our business wants to construct ranking rules.
The process of turning relevance scores into smarter, domain specific signals that quantifiably measure important criteria to you and your data is known as signal modeling. When signal modeling, we deeply manipulate fields to be scored more precisely to measure criteria. It’s what differentiates high quality ranking solutions from those that don’t really try that hard. This is a key idea in Relevant Search and fundamental to how OSC approaches relevance problems.
When you think about all the ways you can manipulate fields in a search engine, you see how Lucene-based solutions can really enable signal modeling to capture important signals. The strength of open source search solutions is how deeply you can manipulate how terms makeup a field and how they’re scored through features such as:
- Analysis to control the composition of terms in an index. Enabling domain-specific synonyms, stopwords, indexing of bigrams, or whatever tokens you can derive from your content!
- Disabling/enabling scoring features such as fieldNorms (tied to field length), term frequency, idf, etc to control various aspects of text-based relevance
- copyFields to copy finer-grained fields into a larger field to generate a more general relevance signal, when users don’t care to differentiate between arbitrarily subdivided fields like an article’s “abstract” and “body”
- Modifying the scoring or similarity through Lucene plugins to directly control relevance scoring
- Leveraging value fields such as user ratings, profitability, sales
- Using geographical information
Indeed, once you realize that fields are NOT storage and retrieval mechanisms, but rather they’re containers to enable scoring, you can begin to use them and the features above to build truly robust and precise relevance signals. You need not take fields from your source system at face value. Instead, you can manipulate them and their scoring to the nth degree to perfect the measurement they deliver at search time. Don’t fall into the trap of NOT mastering the tools at your disposal, instead understand that fields exist to be manipulated -- truly modelled -- to create a relevance signal that satisfies important ranking criteria to your users. As their precision increases, these signals have meaning -- and enable real “semantic” search that expresses ranking in terms both the user and business can associate with meaningful information.
This may seem abstract. The examples of signal modeling can be found throughout our blog as we reflect on how fields are composed and scored. It’s a skill that permeates our relevance philosophy -- to the point that we’re writing a book that features it highly.
If you’d like to get beyond your own basic search solution, to build something more robust. Please contact us! We’d love to help you with your search problems!
Reminder! Pick up the book associated with these ideas! Buy Relevant Search -- enter code turnbullmu to get 38% off!
Opinions expressed by DZone contributors are their own.