Over a million developers have joined DZone.

Payloads Are Neat, but Where’s a Complete Example for Solr?

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

I’ve been a bit frustrated whenever I discuss payloads in Solr by the lack of an example I could find that gave me all the pieces in a single place. So I decided to create one for Solr 4.0+ (actually, 4.8.1 at the time of this writing, but this should apply for all the 4x code line). There are many helpful fragments out there, our own Grant Ingersoll showed how to use payloads in 2009  at the Lucene level.

Since then, payloads have been added to Solr…kinda. There is the DelimitedPayloadTokenFilterFactory that you can use when constructing an analysis chain in schema.xml that will take delimited payload tokens and store the payload along with term. This fieldType and field are even in the standard distribution.

The question, though, is how do you use payloads while querying in Solr? This post provides an end-to-end example.

First a brief review

Payloads are a way to associate a numeric value with a term. So whenever a term in the query matches one in the document, you also have the numeric value available to use in scoring. There a wide variety of uses for payloads, here are a few:
  1. Parts of speech. Let’s say you want to weigh nouns more heavily than adjectives. Leaving aside the problem of recognizing nouns and adjectives… let’s just say you can. You can associate a weight that you can then incorporate in the scoring, using a greater weight for nouns that match terms in the search.
  2. Heuristically discovered correlations, aka “secret sauce”. You’ve analyzed usage patterns and discover that if the initial search phrase contains the word “fishing”, there’s a high likelihood that the user will buy a lure rather than a depth finder. At ingest time, whenever you find the word “fishing” in the description of something you’ve categorized as a “lure”, you add some weight to that term.
  3. Whenever you can correlate any behavior to certain terms, you can weigh these terms more heavily in the score calculations of Solr documents. This kind of processing can be very computationally intensive and thus not performant to do at  search time. If you can offload that processing to index time via adding payloads, you can use the results of computing these correlations and still have performant searches.

Outline of the steps

Remember, I’m leaving aside how you make the correlations here.
  1. Add the payload to the term in the document.
  2. Change your schema.xml file to allow you to make use of that payload.
  3. Change your solrconfig.xml to recognize the new query parser you’re going to write.
  4. Write a new similarity class for the payloaded field.
  5. Use the new query parser in queries.
None if these steps is all that hard, but getting all the parts connected without guidance can be a pain. Here’s the cookbook.

Add the payloaded term to the document.

This is actually the easiest part, just use a pipe delimiter. Your term then looks like “fishing|5.0″.

Change your schema.xml file.

Your schema.xml file will have two changes. The default schema comes with a “payload” field type. This will work fine for _ingesting_ the data, but we need to make one change to use this in scoring; add a new custom similarity to the field type. Your <fieldType> will look like this:

<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
<similarity class="payloadexample.PayloadSimilarityFactory" />
</fieldtype>

The similarity class will be some custom code you’ll see later. It’s job is to take the payload and use it to influence the score for the document. A feature of Solr 4.x is that you can define custom similarities on individual fields, which is what we’ve done here.

There’s one other change you need to make to the schema. Way down at the bottom you might see a comment about custom similarities. Add this line:

<similarity class="solr.SchemaSimilarityFactory"/>

This is what will allow your similarity in the <fieldType> to be found.

Change your solrconfig.xml file.

So far, so good. But how do you actually use this? It turns out that if you do not create your own parser, the default payload scoring is just to return 1.0f. So you need to create a parser that will actually use the value. There are two changes you’ll need; define the lib path for your jar and define a new query parser. This looks like:

<lib dir="path_to_jar_file_containing_custom_code" regex=".*\.jar">

and then:  

<queryParser name="myqp" class="payloadexample.PayloadQParserPlugin" />

Write a new similarity class and query parser for the payloaded field.

OK, here’s the code. This is the longest part of the post, so bear with me. Or skip to the end and copy/paste this later. There are two files that I put in the same jar in my example code. First the PayloadQParserPlugin (see the changes to solrconfig.xml).

package payloadexample;

import org.apache.lucene.index.Term;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.parser.QueryParser;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;
import org.apache.solr.search.QueryParsing;
import org.apache.solr.search.SyntaxError;

// Just the factory class that doesn't do very much in this 
// case but is necessary for registration in solrconfig.xml. public class PayloadQParserPlugin extends QParserPlugin { @Override public void init(NamedList args) { // Might want to do something here if you want to preserve information for subsequent calls! } @Override public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new PayloadQParser(qstr, localParams, params, req); } } // The actual parser. Note that it relies heavily on the superclass class PayloadQParser extends QParser { PayloadQueryParser pqParser; public PayloadQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); } // This is kind of tricky. The deal here is that you do NOT
// want to get into all the process of parsing parentheses, // operators like AND/OR/NOT/+/- etc, it's difficult. So we'll
// let the default parsing do all this for us. // Eventually the complex logic will resolve to asking for
// fielded query, which we define in the PayloadQueryParser // below. @Override public Query parse() throws SyntaxError { String qstr = getString(); if (qstr == null || qstr.length() == 0) return null; String defaultField = getParam(CommonParams.DF); if (defaultField == null) { defaultField = getReq().getSchema().getDefaultSearchFieldName(); } pqParser = new PayloadQueryParser(this, defaultField); pqParser.setDefaultOperator (QueryParsing.getQueryParserDefaultOperator(getReq().getSchema(), getParam(QueryParsing.OP))); return pqParser.parse(qstr); } @Override public String[] getDefaultHighlightFields() { return pqParser == null ? new String[]{} : new String[] {pqParser.getDefaultField()}; } } // Here's the tricky bit. You let the methods defined in the
// superclass do the heavy lifting, parsing all the // parentheses/AND/OR/NOT/+/- whatever. Then, eventually, when
// all that's resolved down to a field and a term, and // BOOM, you're here at the simple "getFieldQuery" call. // NOTE: this is not suitable for phrase queries, the limitation
// here is that we're only evaluating payloads for // queries that can resolve to combinations of single word
// fielded queries. class PayloadQueryParser extends QueryParser { PayloadQueryParser(QParser parser, String defaultField) { super(parser.getReq().getCore().getSolrConfig().luceneMatchVersion, defaultField, parser); } @Override protected Query getFieldQuery(String field, String queryText, boolean quoted) throws SyntaxError { SchemaField sf = this.schema.getFieldOrNull(field); // Note that this will work for any field defined with the // <fieldType> of "payloads", not just the field "payloads". // One could easily parameterize this in the config files to // avoid hard-coding the values. if (sf != null && sf.getType().getTypeName().equalsIgnoreCase("payloads")) { return new PayloadTermQuery(new Term(field, queryText), new AveragePayloadFunction(), true); } return super.getFieldQuery(field, queryText, quoted); } }

What’s with the AveragePayloadFunction()? Well, imagine that you have several terms in the same document each with different payloads. This function will “do the right thing” if the average of those values is “the right thing”. There are some pre-defined payload functions (all deriving from PayloadFunction) that “do the right thing” with the payloads in other cases, e.g. min, max that you can also use. Or, you could write your own if your needs are different.

Now the PayloadSimilarityFactory (see the changes to schema.xml)

package payloadexample;

import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.apache.lucene.search.similarities.Similarity;
import org.apache.lucene.util.BytesRef;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.schema.SimilarityFactory;

public class PayloadSimilarityFactory extends SimilarityFactory {
  @Override
  public void init(SolrParams params) {
    super.init(params);
  }

  @Override
  public Similarity getSimilarity() {
    return new PayloadSimilarity();
  }
}

class PayloadSimilarity extends DefaultSimilarity {

  //Here's where we actually decode the payload and return it.
  @Override
  public float scorePayload(int doc, int start, int end, BytesRef payload) {
    if (payload == null) return 1.0F;
    return PayloadHelper.decodeFloat(payload.bytes, payload.offset);
  }
}

Remember, computers are stupid. Somewhere you have to tell the computer to grab the payload from the term and use it. That is all that you’re doing here. The default scorePayload just returns 1.0, so if you don’t do this step, you’ll be left wondering why your payloads have no effect at all.

Use the new query parser in queries.

First, a brief review of what we’ve done so far:

  • Added payloads to the input file with the pipe (|) delimiter.
  • Change the config files to put the payloads into the index.
  • Added custom code for a new similarity class and query parser to do something with the payload.
At this point, though, just like in the rest of Solr, actually using this information is a matter of having a query parser that actually calls on it. The payload query parser is just like any other query parser, edismax, standard, term, phrase, raw, nested, whatever (see: the CWiki docs). It still must be called upon.

This is actually simple. The example here uses defType, but you could just as easily specify a defType in a request handler that uses this query parser in solrconfig.xml, use it in nested queries, etc. It’s a query parser that you can invoke like any of the ones mentioned in the link above.

http://localhost:8983/solr/collection1/query?defType=myqp&q=payloads:(electronics memory)

Conclusion

My hope is that this complete example will make it easier for others to connect all the  pieces and use payloads from Solr without having to dig around too much. This particular code has some limitations however:
  • It doesn’t handle phrases well. It’ll make a “TermQuery” of the entire phrase, which isn’t what you want.
  • It requires some code investment, it’d be nicer to have native support in Solr.
  • It doesn’t apply across all the different query parsers, it’s restricted to the newly-defined qparser.
    • I chatted with Chris Hostetter (he’s my go-to guy for all things query parser related) and this approach has this merit, quoting:
    • ..you can use that qparser with any field type -- a custom one that adds payloads, PreAnalyzedField, TextField using DelimitedPayloadTokenFilterFactory, whatever...
    • The other approach would be to make a custom FieldType. It’s advantage (again quoting) is
    • If you go the custom FieldType approach, then you automatically get payload based queries from most of the existing query parsers -- but you have to decide (and as a result: constrain) when/how/why payloads are added to your terms in the FieldType logic
So each has merit, it’s just a matter of implementing them in Solr sometime. Siiighhh. Along with the other 50 things I’d like to do.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

Topics:

Published at DZone with permission of Erick Erickson. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}