Over a million developers have joined DZone.

Machine Learning for Smarter Search With Elasticsearch

DZone's Guide to

Machine Learning for Smarter Search With Elasticsearch

Machine Learning is revolutionizing everything — even search. Elasticsearch's Learning to Rank plugin teaches Machine Learning models what users deem relevant.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

It’s no secret that Machine Learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release the Elasticsearch Learning to Rank Plugin. What is Learning to Rank? With Learning to Rank, a team trains a Machine Learning model to learn what users deem relevant.

When implementing Learning to Rank, you need to:

  • Measure what users deem relevant through analytics to build a judgment list grading documents as exactly relevant, moderately relevant, or not relevant for queries.
  • Hypothesize which features might help predict relevance, such as the TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  • Train a model that can accurately map features to a relevance score.
  • Deploy the model to your search infrastructure, using it to rank search results in production.

Don’t fool yourself. Underneath each of these steps lie complex, hard technical, and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

In this blog post, I want to tell you about our work to integrate learning to rank within Elasticsearch. Clients ask us in nearly every relevance consulting engagement whether or not this technology can help them. However, while there’s a clear path in Solr thanks to Bloomberg, there hasn’t been one in Elasticsearch. Many clients want the modern affordances of Elasticsearch, but find this a crucial missing piece to selecting the technology for their search stack.

Indeed, Elasticsearch’s Query DSL can rank results with tremendous power and sophistication. A skilled relevance engineer can use the query DSL to compute a broad variety of query-time features that might signal relevance, giving quantitative answers to questions like:

  1. How much is the search term mentioned in the title?
  2. How long ago was the article/movie/etc. published?
  3. How does the document relate to user’s browsing behaviors?
  4. How expensive is this product relative to a buyer’s expectations?
  5. How conceptually related is the user’s search term to the subject of the article?

Many of these features aren’t static properties of the documents in the search engine. Instead, they are query-dependent, meaning that they measure some relationship between the user or their query and a document. To readers of Relevant Search, this is what we term signals in that book.

So, the question becomes, how can we marry the power of machine learning with existing power of the Elasticsearch Query DSL? That’s exactly what our plugin does: use Elasticsearch Query DSL queries as feature inputs to a Machine Learning model.

How Does It Work?

The plugin integrates RankLib and Elasticsearch. Ranklib takes as input a file with judgments and outputting a model in its own native, human-readable format. Ranklib then lets you trains models either programmatically or via the command line. Once you have a model, the Elasticsearch plugin contains the following:

  • A custom Elasticsearch script language called ranklib that can accept ranklib generated models as an Elasticsearch scripts.
  • A custom ltr query that inputs a list of Query DSL queries (the features) and a model name (what was uploaded at 1) and scores results.

As learning to rank models can be expensive to implement, you almost never want to use ltr query directly. Rather, you would rescore the top N results such as:

 "query": { /*a simple base query goes here*/ },
 "rescore": {
  "window_size": 100,
  "query": {
   "rescore_query": {
    "ltr": {
     "model": {
      "stored": "dummy"
     "features": [{
        "match": {
         "title": < users keyword search >

You can dig into a fully functioning example in the scripts directory of the project. It’s a canned example, using hand-created judgments of movies from TMDB. I use an Elasticsearch index with TMDB to execute queries corresponding to features, augment a judgment file with the relevance scores of those queries and features, and train a Ranklib model at the command line. I store the model in Elasticsearch and provide a script to search using the model.

Don’t be fooled by the simplicity of this example. The reality of a real learning to rank solutions is a tremendous amount of work, including studying users, processing analytics, data engineering, and feature engineering. I say that to not dissuade you because the payoff can be worth it; just know what you’re getting into. Smaller organizations might still do better with the ROI of hand-tuned results.

Training and Loading the Learning to Rank Model

Let’s start with the hand-created, minimal judgment list I’ve provided to show how our example trains a model.

Ranklib judgment lists come in a fairly standard format. The first column contains the judgment (0-4) for a document. The next column is a query id, such as “qid:1.” The subsequent columns contain the values of the features associated with that query-document pair. On the left-hand side is the 1-based index of the feature. To the right of that number is the value for the feature. The example in the Ranklib README is:

3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C 1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A 

Notice also the comment (# 1A , etc). That comment is the document identifier for this judgment. The document identifier isn’t needed by Ranklib, but it’s fairly handy to human readers. As we’ll see it’s useful for us as well when we gather features via Elasticsearch queries.

Our example starts with a minimal version of the above file (seen here). We need to start with a trimmed-down version of the judgment file that simply has a grade, query id, and document id tuple. Like so:

4 qid:1 # 7555 3 qid:1 # 1370 3 qid:1 # 1369 3 qid:1 # 1368 0 qid:1 # 136278 ... 

As above, we provide the Elasticsearch _id for the graded document as the comment on each line.

We need to enhance this a bit further. We must map each query id (qid:1) to an actual keyword query (“Rambo”) so we can use the keyword to generate feature values. We provide this mapping in the header which the example code will pull out:

# Add your keyword strings below, the feature script will 
# Use them to populate your query templates # # qid:1: rambo # qid:2: rocky # qid:3: bullwinkle # # https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/ # # 4 qid:1 # 7555 3 qid:1 # 1370 3 qid:1 # 1369 3 qid:1 # 1368 0 qid:1 # 136278 ... 

To help clear up some confusion, I’m going to start talking about ranklib “queries” (the qid:1 etc) as “keywords” to differentiate from the Elasticsearch Query DSL “queries” which are Elasticsearch-specific constructs used to generate feature values.

What’s above isn’t a complete Ranklib judgment list. It’s just a minimal sample of relevance grades for given documents for a given keyword search. To be a fully-fledged training set, it needs to include the feature values shown above, the 1:0 2:1 … included after each line in the first judgment list shown.

To generate those feature values, we also need to have proposed features that might correspond to relevance for movies. These, as we said, are Elasticsearch queries. The scores for these Elasticseach queries will finish filling out the judgment list above. In the example above, we do this using a jinja template corresponding to each feature number. For example, the file 1.json.jinja is the following Query DSL query:

 { "query": { "match": { "title": "" } } } 

In other words, we’ve decided that feature 1 for our movie search system ought to be the TF*IDF relevance score for the user’s keywords when matched against the title field. There’s also 2.jinja.json , which performs a more complex search across multiple text fields:

{ "query": { "multi_match": { "query": "", "type": "cross_fields", "fields": ["overview", "genres.name", "title", "tagline", "belongs_to_collection.name", "cast.name", "directors.name"], "tie_breaker": 1.0 } } } 

Part of the fun of learning to rank is hypothesizing what features might correlate with relevance. In the example, you can change features 1 and 2 to any Elasticsearch query. You can also experiment by adding additional features 3 through however many. There are problems with too many features, as you’ll want to get enough representative training samples that cover all reasonable feature values. We’ll discuss more training and testing learning to rank models in a future blog post.

With these two ingredients, the minimal judgment list and a set of proposed Query DSL queries/features, we need to generate a fully-fleshed out judgment list for Ranklib and load the Ranklib generated model into Elasticsearch to be used. This means:

  1. Getting relevance scores for features for each keyword/document pair. Aka issuing queries to Elasticsearch to log relevance scores.
  2. Outputting a full judgment file not only with grades and keyword query ids but also with feature values from step 1:
  • Running Ranklib to train the model.
  • Loading the model into Elasticsearch for use at search time.
  • The code to do this is all bundled up in train.py, which I encourage you to take apart. To run this, you’ll need:

    • RankLib.jar downloaded to the scripts folder.
    • Python packages Elasticsearch and Jinja2 installed (there’s a Python requirements.txt if you’re familiar).

    Then you can just run:

    python train.py 

    This one script runs through all the steps mentioned above. To walk you through the code:

    First, we load the minimal judgment list with just our document, keyword query id, grade tuples, with search keywords specified in the file’s header:

    judgements = judgmentsByQid(judgmentsFromFile(filename='sample_judgements.txt')) 

    We then issue bulk Elasticsearch queries to log features for each judgment (augmenting the passed in judgments).

    kwDocFeatures(es, index='tmdb', searchType='movie', judgements=judgements) 

    The function kwDocFeatures finds 1.json.jinja through N.json.jinja (the features/queries), and strategically batches Elasticsearch queries up to get a relevance score for each keyword/document tuple using Elasticsearch’s bulk search (_msearch) API. The code is tedious, you can see it here.

    Once we have the fully fleshed-out features, we then output the full training set (judgments plus features) into a new file (sample_judgements_wfeatures.txt):

    buildFeaturesJudgmentsFile(judgements, filename='sample_judgements_wfeatures.txt') 

    The output will correspond to a fully fleshed-out Ranklib judgment list, ala:

    3 qid:1 1:9.476478 2:25.821222 # 1370 3 qid:1 1:6.822593 2:23.463709 # 1369 

    Where feature 1 is the TF*IDF score of “Rambo” searched on the title (1.json.jinja); feature 2 is the TF*IDF score of the more complex search (2.json.jinja).

    Next, we train! This line runs Ranklib.jar via the command line using this saved file as judgment data

    trainModel(judgmentsWithFeaturesFile='sample_judgements_wfeatures.txt', modelOutput='model.txt') 

    As you can see below, this just basically runs java -jar Ranklib.jar training a LambdaMART model:

    def trainModel(judgmentsWithFeaturesFile, modelOutput): # java -jar RankLib-2.6.jar -ranker 6 -train sample_judgements_wfeatures.txt -save model.txt cmd = "java -jar RankLib-2.6.jar -ranker 6 -train %s -save %s" % (judgmentsWithFeaturesFile, modelOutput) print("Running %s" % cmd) os.system(cmd) 

    We then store the model into Elasticsearch using simple Elasticsearch commands:

    saveModel(es, scriptName='test', modelFname='model.txt') 

    Here, saveModel, as seen here, just reads the file contents and POSTs it to Elasticsearch as a ranklib script to be stored.

    Searching With the Learning to Rank Model

    Once you’re done training, you’re ready to issue a search! You can see an example in search.py; it’s pretty straightforward with a simple query inside. You can run python search.py rambo, which will search for “rambo” using the trained model, executing the following rescoring query:

    { "query": { "match": { "_all": "rambo" } }, "rescore": { "window_size": 20, "query": { "rescore_query": { "ltr": { "model": { "stored": "test" }, "features": [{ "match": { "title": "rambo" } }, { "multi_match": { "query": "rambo", "type": "cross_fields", "tie_breaker": 1.0, "fields": ["overview", "genres.name", "title", "tagline", "belongs_to_collection.name", "cast.name", "directors.name"] } }] } } } } } 

    Notice we’re only reranking the top 20 results here. We could use ltr query directly. Indeed, running the model directly works pretty well. Albeit it takes a few hundred milliseconds to run over the whole collection. For a larger collection, it wouldn’t be feasible. In general, it’s best to rerank top N results due to the performance cost of learning to rank models.

    And that’s the working example. Of course, this is just a dumb, canned example meant to get your juices flowing. Your particular problem likely has many more moving parts. The features you choose, how you log features, train your model, and implement a baseline ranking function depend quite a bit on your domain. Much of what we write about in Relevant Search still applies. 

    What’s Next

    In future blog posts, we’ll have much more to say about learning to rank, including:

    • Basics: More about what learning to rank is exactly.
    • Applications: Using learning to rank for search, recommendation systems, personalization and beyond.
    • Models: What are the prevalent models? What considerations play in selecting a model?
    • Considerations: What technical and non-technical considerations come into play with Learning to Rank?

    If you think you’d like to discuss how your search application can benefit from learning to rank, please let us know. We’re also always on the hunt for collaborators or for more folks to beat up our work in real production systems. So, give it a go and send us feedback!

    Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

    elasticsearch ,machine learning ,search relevancy ,big data

    Published at DZone with permission of Doug Turnbull, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.


    Dev Resources & Solutions Straight to Your Inbox

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}